21 August 2012

Fixing Hacker News: A mathematical approach

There is a certain phenomenon that seems to happen in almost every online community of user-generated content. A community is created: the initial users define the values of this new community. After a while the community experiences growth in numbers. As a result of that growth, users that joined before it feel like its no longer the same community with the same values. The latest widely discussed example seems to be Hacker News.

Paul Graham responds that the reason is mostly a shift in values and increase of anonymity:
It's a genuine problem and has been growing gradually worse for a while. I think the cause is simply growth. When a good community grows, it becomes worse in two ways: (a) more recent arrivals don't have as much of whatever quality distinguished the original members, and (b) the large size of the group makes people behave worse, because there is more anonymity in a larger group.
I've spent many hours over the past several years trying to understand and mitigate such problems. I've come up with a bunch of tweaks that worked, and I have hopes I'll be able to come up with more.

The idea I'm currently investigating, in case anyone is curious, is that votes rather than comments may be the easiest place to attack this problem. Although snarky comments themselves are the most obvious symptom, I suspect that voting is on average dumber than commenting, because it requires so much less work. So I'm going to try to see if it's possible to identify people who consistently upvote nasty comments and if so count their votes less.
As online communities grow, the values of the group shift. The majority now may or may not hold the same values as the majority before. The question is, how to preserve the old values of the group with minimum side-effects?

As it happens, my master's thesis was an attempt to fix exactly this problem mathematically and implement an improved voting system tailored specifically for communities with user-submitted content. I won't provide a link to the thesis as its not written in English, but I'll try to summarize the gist of it.

The voting system used in most communities today (democratic voting) is the one most susceptible to value shift when significant growth occurs. Its no surprise: democratic systems are designed to measure what the majority values. When significant growth occurs, the majority changes and therefore what they value also changes.

In contrast, previous moderator/editor based systems offer a strict filter on content based on the more static values of the current set of editors. However, it has the downside of being limited to what the editors are able to review and publish.

I propose a hybrid feedback-loop based system. In this system people have variable voting influence and editor-like individuals are given as a "reference point" or exemplary users with maximum voting influence. The system attempts to find out what they value and recognize it in others.

The system is based on the mathematics described in the beta reputation system, which is a system for measuring trust in online e-commerce communities.

Here is a short description of the system:
  • Voting influence is not the same for all users: its not 1 (+1 or -1) for everyone but in the range 0-1.
  • When a user votes for a content item, they also vote for the creator (or submitter) of the content.
  • The voting influence of a user is calculated using the positive and negative votes that he has received for his submissions.
  • Exemplary users always have a static maximum influence.
Suppose we have a content item \(C\) submitted by the user \(U_c\). Now a voter \(V\) comes to vote for it and clicks on the +1 button.

The voter has his own submissions for which he has received a total amount of positive vote \(p_V\) and a total amount of negative vote \(n_V\). As a result, his voting influence \(i_V\) is modified: its not +1 but calculated according to the formula:
$$ i_V = f_W(p_V, n_V) $$
where \(f_W\) is the lower bound of Wilson score confidence interval. While a simple average such as:
$$ i_V = \frac{p_V}{p_V + n_V} $$
might work when the number of positive and negative votes is large enough, its not good enough when the number of votes is low. The Wilson score confidence interval gives us a better, flexible balance between desired certainty in the result and the result itself.

This vote in turn is received by the content item \(C\). Because its a positive vote, the amount of positive vote \(p_C\) is changed for this content item
$$ p_C \leftarrow p_C + i_V $$
and as a result, it has a new rating
$$ r_c = f_W(p_c, n_c) $$
but the positive points \(p_U\) of the creator of the content item are also changed:
$$ p_U \leftarrow p_U + i_V $$
and as a result the voting influence \(i_U\) of submitter is also changed:
$$ i_U = f_W(p_U, n_U) $$
or in other words, he has "earned" a bigger influence in the voting system by submitting a well-rated content item.

This means that new members have no voting influence. As they submit content and receive votes their influence may rise if the existing users with high influence in the system consider their content to be good.

This is where the reference users \(R\) come in. Their influence is fixed to always be 1
$$ i_R = 1 $$
Because of this, influence propagates through the system from them to other users who submit content which is deemed high-quality by the reference users. Those users in turn also change influence by voting for others and so forth.

Its also possible to scale down votes as they age. The two possible strategies are to scale all \(p_X\) and \(n_X\) values daily, for all content items and all users by multiplying them with a certain aging factor \(k_a\)
$$ p_X \leftarrow k_a p_X $$
$$ n_X \leftarrow k_a n_X $$
or to simply keep all positive and negative votes \(V_p\) and \(V_n\) in the database and recalculate \(p_X\) and \(n_X\) according to the age of the votes \(a_V\), for example:
$$ p_X = \sum_{\forall V_p} { i_V k_a^{a_V} }$$
$$ n_X = \sum_{\forall V_n} { i_V k_a^{a_V} }$$
One of the convenient aspects of this system is that its easy to test-drive. It doesn't require more user action than simple democratic voting. It only requires an administrator to specify some reference users at the start which seed and then propagate influence throughout the system.

I tested this system on a forum dataset (details available on request) and found that the system achieves around 50% reduction of difference from a moderator only system compared to scores of a democratic system, even when the direct voting of reference users is turned off for content items and only the indirect (to other users) influence is counted. \((p < 0.05)\)

What does a 50% reduction in the difference mean? Let the score of a content item \(C\) be measured in 3 systems: democratic \(D\), reference-users-only \(R\) and hybrid \(H\) with direct influence of reference users to content items being turned off. By sorting the items according to those scores we can calculate their ranks in the 3 systems: \(r_D\), \(r_R\) and \(r_H\) respectively. The value of the rank is in the range \(1\) to \(n\), where \(n\) is total number of content items. The absolute difference between the democratic ranking and the reference ranking \(d_{DR}\) is:
$$ d_{DR} = abs(r_D - r_R) $$
while the absolute difference between the hybrid ranking and the reference ranking \(d_{HR}\) is:
$$ d_{HR} = abs(r_H - r_R) $$
and it turns out that on average,
$$ d_{HR} = 0.5 d_{DR} $$
The important downside of these results is that the people using the system were not aware that points are calculated in a different way. The original votes were given by people who knew that the system is democratic and acted accordingly. It remains to be seen what the results would be if people are aware that their voting influence depends on the way others vote for their submitted content.

I pondered starting a website similar to hacker news based on this voting and scoring scheme, however starting a whole new news website is about much more than just scoring algorithms (it requires reputation in the online comminty, popularity and most importantly time, none of which I presently have in sufficient amounts or know how to achieve). But hopefully, pg and the rest of the hacker news team might find this scheme useful enough to somehow incorporate it into the existing scoring system.

10 August 2012

Why native development sucks and HTML5 rocks: Porting to Windows 8

Lately, HTML5 mobile app development has received a lot of bashing all over the Internet. Most developers only quickly skim over the benefits and proceed to bash all the downsides. The conclusions they usually arrive at: HTML5 slow, inconsistent, limited and doesn't have a native look and feel. Therefore, native development is superior.

I'm not going to go point by point to defend or debunk these claims today. Instead, I'd like to give an example of the benefits of HTML5.

Here at CreationPal, we make SportyPal - a reasonably popular mobile app that tracks your workouts. Its a native non-trivial app implemented for multiple platforms, so we do have some experience with native technologies.

When we were discussing our new product DoxOut, we concluded that the best way to go is to use HTML5. As a result of this decision, it didn't take much time to port our prototype presentation app to the new Windows 8 UI (formerly codenamed Metro)

The time needed: 1 hour.

How is that possible?

To improve experience, we have a thin native layer for each platform. If a component of the layer is not available, a default HTML5 replacement is provided that works well enough in most browsers. The rest of the code and UI is pure HTML5.

The new Windows 8 UI is all about HTML5, CSS and JavaScript. An application has the same structure as a webpage, with a main html file. What is different is that there are extra "native" libraries available in the global JavaScript namespace which you can use to access native functionality. As a result, it was simply a matter of pointing the app to the relevant HTML, CSS and JavaScript.

Well, not quite.

We quickly got the following error:
JavaScript runtime error: Unable to add dynamic content. A script attempted to inject dynamic content, or elements previously modified dynamically, that might be unsafe. For example, using the innerHTML property to add script or malformed HTML will generate this exception. Use the toStaticHTML method to filter dynamic content, or explicitly create elements and attributes with a method such as createElement.  For more information, see http://go.microsoft.com/fwlink/?LinkID=247104.
The new Windows 8 UI has a strict html security model. Directly inserting potentially-unsafe HTML is not allowed, and since we are using a jQuery-based micro-templating library that does exactly that, we quickly ran into the error.

This was the offending line in the jQuery code

// ... snip ...
append: function () {
    return this.domManip(arguments, true, function (elem) {
        if (this.nodeType === 1) {
// ... snip ...

We learned quickly that a potential solution is to wrap unsafe execution using MSApp.execUnsafeLocalFunction :

append: function () {
    return this.domManip(arguments, true, function (elem) {
        if (this.nodeType === 1) {
            var self = this;
            // In Metro, execUnsafeLocalFunction is needed to
            // append a child which contains arbitrary innerHTML
            if (window["MSApp"]) MSApp.execUnsafeLocalFunction(function () {
            else this.appendChild(elem);

As this circumvents protection for all jQuery.fn.html calls,  its not an ideal solution. However the alternative involves giving up features such as data-attributes which was unacceptable.

The only remaining modification was to add the Microsoft-specific CSS properties (such as -ms-linear-gradient) to our LESS mixin classes and $.fn.css calls

Note: jQuery 1.8 makes the last practice obsolete: the newest $.fn.css automatically transforms the property to a browser-specific version, if needed.

After this modification the entire codebase worked flawlessly.

Now imagine the time it would take to port a native app to a new platform.

Despite all the negative reputation, HTML5 remains the fastest way to develop multi-platform apps. The quick porting of our DoxOut Presentation prototype is to Windows 8 confirms that (though not exactly a fair example as Metro is already heavily HTML5-based),  And with efforts such as AppJS, the number of OSes that support mixed HTML5-native development keeps growing.

Switch to HTML5, today!

Let it snow

I thought I'd post my javascript snow one-liner. Note that it doesn't work in firefox due to the new protection mechanisms. In Chrome it does, however you will need to type in the "javascript:" part manually because it automatically strips it off from the pasted text
  1. Open a website
  2. Delete the address from the address bar, replace it with "javascript:" (without the quotes)
  3. Add this and press enter:
    (function(){for(var c=[],d=Math.random,f=document.body.clientWidth-32,b=0;50>b;++b){c.push({c:document.createElement("div"),g:{b:0,a:d()*f,f:2,e:2,d:1}});c[b].c.innerHTML="*";var e=c[b].c.style;e.position="absolute";e["font-size"]=12+12*d()+"px";e.color="rgba(255,255,255,0.75)";e["text-shadow"]="0px 0px 5px #aaa";e.zIndex=65535;document.body.appendChild(c[b].c)}setInterval(function(){for(var a,b=0;b<c.length;++b)a=c[b].g,a.d=0.1>d()?!a.d:a.d,a.e=0+(1+2*d())*(a.d?1:-1),a.f=1+2*d(),a.b+=a.f,a.a+=a.e,

Intuitive JavaScript array filtering function part 2

Last time I wrote about my JavaScript array filtering function intuitiveFilter. It had one caveat: namely, the way it handles sub-arrays in the data. It doesn't allow the user to inspect a sub-array without writing a custom filtering function for the sub-array field.

To inspect an array field, one would need a filtering function which takes an array argument and returns true or false. For example, the function might check the contents of the array and see if the elements match certain rules. Depending on the number of elements matching the rules, it would return true or false.

We can already filter the contents of an array with intuitiveFilter. That means we can easily use it to inspect the elements of a sub-array. All we need now is to specify how many results we would like to have. The following rules would be useful:

  • there are exactly N elements satisfying the conditions
  • there are at least / at most N elements satisfying the conditions
  • none of the elements satisfy the condition
  • all of the elements satisfy the conditon

Now in order to implement this without modifying the original function, we can use a cool feature of functional languages: a higher order function which returns a function.

Why would we need to return a function?

We left only one extension mechanism for our filter: custom functions for fields. They take the field's value as an argument. They should return true/false depending on it. We called those basic filter functions.

Because we've been given the ability to return a function from another function, we could build a filter function from a rule object and then return it. Lets make a simple example

function has(rules) {
    return function(item) {
        return intuitiveFilter(item, rules).length > 0

What just happened here? We return a filter function which applies intiuitiveFilter to the array item and checks if it contains at least one element matching the rules. We've returned exactly what intuitiveFilter could use - a filter function that takes an array item and returns a boolean. It looks as if we wrote a shorter alias to replace some boilerplate code.

Remember the old way to write an array filter?

intuitiveFilter(arrayArray, {children: function(arr) { 
    return intuitiveFilter(arr, {age:{gt:10}}).length > 0; }

We can now write it like so:

intuitiveFilter(arrayArray, {children: has({age:{gt:10}})});
Isn't that beautiful? We've removed a lot of boilerplate code and the result is elegant and simple. Admittedly, it has a lot of braces and parentheses, but then again, so does Lisp. Now lets see if we can provide a richer set of filtering rules.

Lets start with some intuitive syntax ideas:

checkif(array, [ {has: filter,
    atLeast: 2, atMost:2}, ...]);

There are two possible interpretations for the usage of the array here. One of them would be the equivalent of an or operator. For example, [{has: {age: {gt:10}}, atLeast:1}, {has: {age: {lt: 8}}, atLeast: 1}] would mean the following: has at least one child older than 10 or has at least one child younger than 8. This is consistent with the meaning of arrays as they are used in intuitiveFilter. However, in this case, the or operator is a lot less useful to as than the and operator. Using the or operator on a single field is already possible through intuitiveFilter. Using the and operator isn't, even though that would be useful for array fields.

We're going to break consistency for the sake of completeness. The rule array argument of checkif will mean and instead of or, which means that all of the rules must be satisfied. We're going to have a slightly shaky abstraction this way, but its going to be a more useful one.

Finally, lets define some shorthand variants: checkif(array, {has: filter, atLeast:2}); - if we only need one rule, the argument can be the rule. checkif(array, {has: filter}); - default meaning is "atLeast: 1" checkif(array, {none: filter}); - shorthand for exactly: 0 checkif(array, {all: filter}); - all elements must satisfy the filter

And here is the code:

checkif = function(rules) {
    if (!$.isArray(rules)) { rules = [ rules ]; }
    for (var k = 0; k < rules.length; ++k) {
        if (rules[k].has && !("atLeast" in rules[k] 
                    || "atMost" in rules[k])) {
            rules[k].atLeast = 1;
    var checkLimits = function(filtered, rule) {
        return ((!("atMost" in rule)
                    || filtered <= rule.atMost)
                && (!("atLeast" in rule)
                    || filtered >= rule.atLeast)
                && (!("exactly" in rule)
                    || filtered == rule.exactly));
    var checkRule = function(filtered, total, rule) {
        return ((rule.has && checkLimits(filtered, rule))
                || (rule.none && !filtered)
                || (rule.all 
                    && filtered == total 
                    && checkLimits(filtered, rule)))
    return function(array) {
        for (var k = 0; k < rules.length; ++k) {
            if (!checkRule(intuitiveFilter(array, 
                    rules[k].has ? rules[k].has 
                    : rules[k].none ? rules[k].none
                    : rules[k].all).length, 
                array.length, rules[k])) return false;
        return true;

Some fun examples follow:

var testArray = [
    {name:"John",  age: 40, children: [{name:"Merlin", age:10}],
  company:{ name: "MegaCorp", employees: 200}},
    {name:"Sue",   age: 30, children: [{name:"Paco", age: 3}],
  company:{ name: "MegaCorp", employees: 200}},
    {name:"Mary", age: 55, children: [
        {name:"Joe", age: 17}, {name:"Moe", age:19}],
        company:{ name: "MegaCorp", employees: 200}},
    {name:"Jack",  age: 20, children: [],
 company:{ name: "MiniCorp", employees: 100}}
    {children: checkif({has: { age: { gt: 5}}, atLeast: 1})}));
    // John, Mary

    {children: checkif({none: { }})})); // Jack

    {children: checkif({all: { age: {gt: 12}}})})); // Jack and Mary

Note: "all" will always return true for empty arrays, as there are no items that don't satisfy the imposed conditions. This can be modified by adding atLeast: 1:

// Just Mary
    {children: checkif({all: { age: {gt: 12}}, atLeast: 1})}));

And now we've extended our filter language with rich syntax to deal with sub-arrays without touching the original filtering function. Wasn't that great?

Intuitive JavaScript array filtering function

When dealing with large JSON arrays on the client side or in node.js, one of our tasks might be to filter them on the client side before displaying them. Array.filter exists since JavaScript 1.6, however it seems kinda dry: all filters must be functions, even some really common filters such as matching text with a regular expression, checking if a number is within a range, checking if an enumeration has a certain value. Consider the following:
var testArray = [
        {name:"John",  age: 40, children: 2,
            company: { name: "MegaCorp", employees: 200}},
        {name:"Sue",   age: 30, children: 1,
            company:{ name: "MegaCorp", employees: 200}},
        {name:"Mary",  age: 55, children: 3,
            company:{ name: "MegaCorp", employees: 200}},
        {name:"Jack",  age: 20, children: 0,
            company:{ name: "MiniCorp", employees: 100}}];

// Using a for loop
var filtered = [];
for (var k = 0; k < testArray.length; ++k) {
    var item = testArray[k];
    if (item.age == 40 && item.age == 30) filtered.push(item); 

// Using Array.filter
testArray.filter(function(item) {
    return item.age == 40 || item.age == 30
}); // returns John

The Array.filter variant is around two times shorter than the for loop variant. It also looks much cleaner: the anonymous function is the filter which is called on each item to check if it should get through the filter or not. We can call Array.filter with various kinds of "filter functions". Should be good enough for all uses, right?

Not so, at least not when you have lots of filters and you want them to be as short as possible to write. Or if you want to combine multiple filter functions from various sources and the items in the data array are fairly complex.

Lets say that we have a complex filter function for the company object and a simple regex name filter from elsewhere and we need to combine them. We would have to write the following:

testArray.filter(function(item) { 
    return /^J.+$/.test(item.name)
        &&  complexFilter(item.company); });

However, now we can't easily replace complexFilter for the company with anotherComplexFilter. We have to write code - to write a different anonymous function and use it instead.

Now imagine having multiple different complexFilters. Soon enough you will write the following function

intiutiveFilterBeta = function(someArray, filters) {
    return someArray.filter(function(item) {
        for (var k = 0; k < filters.length; ++k) {
            if (!filters[k](item)) return false;
        return true;

which will enable you to combine different complex filters into a filter array, essentially implementing the and operator.

At about this point you will probably realize that you're missing the or operator. What if you wish to filter all companies which complexCompanyFilter1 or complexCompanyFilter2 ? If you're like me, right now you're probably working on a DSL (domain specific language) in your head, a DSL which reminds you of SQL. You might also start thinking that this is going a bit over the top.

However, if you look closely you will notice certain peculiarity about the and operator: you don't really need to use and on two or more filters which are working on the same field. For example, you might want to match 1 or 2 children, but never both 1 and 2 children - it just doesn't make sense. You might also want to have a "between" filter for age, but you wouldn't exactly want to and two between filters. Instead of between 30 and 50 and between 40 and 60 you would simply write a between 40 and 50 filter.

This observation seems to hold true for all primitive values except for strings. That doesn't really matter because we can easily filter strings with a tool made to do exactly that: regular expressions.

I decided to try and make a hopefully intuitive and readable yet still powerful filter function based on the observations above. It should enable some common primitive tests to be easily written without writing new functions. It should support the AND and OR operators intuitively and without writing functions in the most common cases. Finally, it should still enable writing custom filter functions. I came up with this:

intuitiveFilter = function (array, filter) {
    var itemFilter = function (iFilter, item) {
        if ($.isFunction(iFilter)) {
            return iFilter(item);
        else if ($.isArray(iFilter)) {
            for (var k = 0; k < iFilter.length; ++k) {
                if (itemFilter(iFilter[k], item)) return true;
            return false;
        else if (typeof(item) == 'string' && iFilter 
            && iFilter.test && iFilter.exec) { 
            return iFilter.test(item);
        else if (item === item + 0 && iFilter
            && (iFilter.lt || iFilter.gt || iFilter.le
            || iFilter.ge)) {
            // item is number and filter contains min-max
            return ((!("lt" in iFilter) || item <  iFilter.lt)
                &&  (!("gt" in iFilter) || item >  iFilter.gt)
                &&  (!("le" in iFilter) || item <= iFilter.le)
                &&  (!("ge" in iFilter) || item >= iFilter.ge));
        else if (typeof (iFilter) === "object") {
            for (var key in iFilter) {
                if (!itemFilter(iFilter[key], item[key]))
                    return false;
            return true;
        return (iFilter == item);
    var filtered = [];
    for (var k = 0; k < array.length; ++k) {
        if (itemFilter(filter, array[k])) 
    return filtered;

Note: it requires jQuery. Yes, I know that using the global scope like that is a bad idea.

And here are some neat ways to use it:

var testArray = [
        {name:"John",  age: 40, children: 2, 
            company:{ name: "MegaCorp", employees: 200}},
        {name:"Sue",   age: 30, children: 1,         
            company:{ name: "MegaCorp", employees: 200}},
        {name:"Mary",  age: 55, children: 3,        
            company:{ name: "MegaCorp", employees: 200}},
        {name:"Jack",  age: 20, children: 0, 
            company:{ name: "MiniCorp", employees: 100}}

    {name:/J.+/, age: {lt: 30}})); // Jack, but not John
    {age: [{gt: 15, le: 20}, {gt: 50}]})); // Jack and Mary
    {children: [0,1]})); // Jack, Sue

    {company: {name: "MegaCorp"}})) // all except Jack
    {age: function(a) { return a % 10 == 0 }})); // all except Mary
    [{age: 30 }, {company:{name:"MiniCorp"}}])); // Sue and Jack

The function is designed to make most filters look like a part of an item from the array that is being filtered. The examples demonstrate some possible uses.

In the first example-set, the first one is a classic and operator with a regex and a numeric operator for age. The second example showcases the simple numeric support. The third example is the purest form of the or operator on the number of children. Similar filters could easily be written for the string name with regular expressions, for example: {name:[/M.+/, /S.+/]}. Isn't that concise and lovely?

In the second set, the example {company: {name: "MegaCorp"}} showcases the ability of the filter to go deeper in the object. The second example shows the ability of the filter to use custom functions anywhere. The last example demonstrates the ability to use the or operator on filters which work on different fields.

The function would've been perfect if it wasn't for a caveat: it cannot check into arrays the same way it can check into an object. For example, if we had the following data:

var arrayArray = [{name:"John",  age: 40, 
    children: [{name:"Joe", age:12}, {name:"Jane", age:10}], 
    company:{ name: "MiniCorp", employees: 100}}]
we wouldn't have a way to test the contents of the sub-array children without writing a function:
intuitiveFilter(arrayArray, {children: function(arr) { 
    return childrenArrayFilter(arr, {age:{gt:10}}).length > 0; }

This caveat isn't hard to fix. However, I've decided that I'll leave it unfixed for now: let the fix be an exercise for the reader. If this code generates some interest I will supply my fix later. The fix can even be added without modifying the original function.