How Facebook privacy failed me

At some point, I put extra email addresses on Facebook because I thought it was necessary for something, but didn’t want to show them, so in the privacy settings marked their visibility as “Only Me.” It turns out that right now, Facebook is blatantly ignoring that privacy setting, and instead showing them to the world.

Here are my settings:

Here is a fragment of my profile, viewed from a friend’s account half an hour ago:

I would complain to their customer service, but I can’t find a link from their Help Center page.

Obviously, this particular issue is a very minor concern for me. But it hardly instills faith in the system — especially considering that privacy bugs are ones that the affected user, almost by definition, can’t see. I’ve also had other weird issues where changes to privacy settings don’t seem to stick when I save then later go back to the page. It’s annoying and hard to verify these things — which is why important “social utilities” like Facebook have to be 99.99999% bug-free for things like this in order to deserve user trust.

Update #1: only by talking to a Facebook employee, I learned that if you search for “bugs” on the Help Center page, you get to the bug report form, and they place a high priority on privacy bugs. I guess this makes me feel a bit better. I’d be interested in knowing the incidence rate of privacy bugs like this. How often does people’s information get revealed without them knowing it? Does the general public of Facebook users have any way of knowing? This seems like territory a third-party consumer or privacy advocacy organization should work on (e.g. Consumer Union or EFF or something.)

Update #2: The bug seems to have disappeared from my friend’s view of my profile. They fixed the bug fast; this was seen about 30 minutes after reporting the bug. Great!

But still, I’m not going to trust Facebook’s privacy settings again, unless there is a credible argument and independent verification that they actually implement what they promise. Since even a low error rate is still bad, verification is hard (rare event detection problem), so I’m not sure what would be a convincing demonstration that privacy settings actually work…

Update #3: Apparently this was a brief but massive and wide-reaching bug. Maureen reports.

6 Comments

List of probabilistic model mini-language toolkits

There are an increasing number of systems that attempt to allow the user to specify a probabilistic model in a high-level language — for example, declare a (Bayesian) generative model as a hierarchy of various distributions — then automatically run training and inference algorithms on a data set. Now, you could always learn a good math library, and implement every model from scratch, but the motivation for this approach is you’ll avoid doing lots of repetitive and error-prone programming. I’m not yet convinced that any of them completely achieve this goal, but it would be great if they succeeded and we could use high-level frameworks for everything.

Everyone seems to know about only a few of them, so here’s a meager attempt to list together a bunch that can be freely downloaded. There is one package that is far more mature and been around much longer than the rest, so let’s start with:

  • BUGS – Bayesian Inference under Gibbs Sampling. Specify a generative model, then it does inference with a Gibbs sampler, thus being able to handle a wide variety of different sorts of models. The classic version has extensive GUI diagnostics for convergence and the like. BUGS can also be used from R. (The model definition language itself is R-like but not actually R.)

    BUGS has had many users from a variety of fields. There are many books and zillions of courses and other resources showing how to use it to do Bayesian statistical data analysis. BUGS is supposed to be too slow once you get to thousands of parameters. The original implementation, WinBUGS, is written in Delphi, a variant of Pascal (!); its first release was in 1996. There are also two alternative open-source implementations (OpenBUGS, JAGS).

    This is clearly very mature and successful software. Any new attempts to make something new should be compared against BUGS.

Next are systems that are much newer, generally less than several years old. Their languages all fall broadly into the category of probabilistic graphical models, but there are plenty of differences and specializations and assumptions that are a project in itself to understand. In lieu of doing a real synthesis, I’ll just list them with brief explanations.

  • Factorie focuses on factor graphs and discriminative undirected models. Claims to scale to millions of parameters. Written in Scala. New as of 2009. Its introductory paper is interesting. From Andrew McCallum’s group at UMass Amhearst.

  • Infer.NET. I only just learned of it. New as of 2008. Focuses on message-passing inference. Written in C#. From MSR Cambridge. I actually can’t tell whether you get its source code in the download. All other systems here are clearly open source (except WinBUGS, but OpenBUGS is a real alternative).

  • Church. Very new (as of 2009?), without much written about it yet. Focuses on generative models. Seems small/limited compared to the first three. Written in Scheme. From MIT.

  • PMTK – Probabilistic Modeling Toolkit. I actually have no idea whether it does model specification-driven inference, but the author’s previous similar-looking toolkit (BNT) is fairly well-known, so it’s in this list. Written in Matlab. From Kevin Murphy.

  • HBC – Hierarchical Bayesian Compiler. Similar idea as BUGS, though see webpage for a clear statement of its somewhat different goals. It compiles the Gibbs sampler to C, so it’s much faster. Seems to be unmaintained. Written in Haskell. From Hal Daume.

Finally, there are a few systems that seem to be more specialized. I certainly haven’t listed all of them; see the Factorie paper for a list of a few others.

  • Alchemy – an implementation of the Markov Logic Network formalism, an undirected graphical model over log-linear-weighted first-order logic. So, unlike BUGS and the above systems, there are no customized probability distributions for anything; everything is a Boltzmann (log-linear) distribution. At least, that’s how I understood it from the original paper. The FOL is essentially a language the user uses to define log-linear features. Alchemy then runs training algorithms to fit the their weights to data.

    From Pedros Domingos’ group at UWashington. Written in C++. I’ve heard people complain that Alchemy is too slow. But in fairness, all these systems are slower than customized implementations.

  • Dyna is specialized for dynamic programming. The formalism is weighted Horn clauses (weighted Prolog). Implements agenda-based training/inference algorithms that generalize Baum-Welch, PCFG chart parsers, and the like. Written in C++, compiles to C++. Seems unmaintained. From Jason Eisner’s group at John Hopkins.

    Since it only does dynamic programs, Dyna usefully supports a much more limited set of models than the above systems. But I expect that means it can train and infer with models that the above would be hopeless to handle, since dynamic programming gives you big-O efficiency gains over more general algorithms. (But on the other hand, even dynamic programming can be too generic and slow compared to direct, customized implementations. That’s the danger of all these systems, of course.)

  • BLOG – first-order logic with probability, though a fairly different formalism than MLNs. Focuses on problems with unknown and unknown numbers of objects. I personally don’t understand the use case very well. Its name stands for “Bayesian logic,” which seems like an unfairly broad characterization given all the other work in this area. From Brian Milch. Seems unmaintained? Written in Java.

An interesting axis of variation of all these is whether the model specification language is Turing-complete or not, and to what extent training and inference can be combined with external code.

  • Turing-complete: Factorie, Infer.NET, Church, and Dyna are all Turing complete. The modeling languages of the first three are embedded in general procedural programming languages (Scala, C#, and Scheme respectively). Dyna is Turing complete in two different ways: it has a complete Prolog-ish engine, which is technically Turing complete but is gonna be a pain to do anything normal in (I simply mean, since Prolog is technically Turing-complete but a total pain to do anything non-Prolog-y in); but also, it compiles to C++.
  • Not Turing-complete: BUGS, HBC, Alchemy/MLN, and BLOG use specialized mini-languages. BUGS’ and HBC’s languages are essentially the same as standard probabilistic model notation, though BUGS is imperative. Alchemy and BLOG are logic variants.
  • Compiles to Turing-complete: HBC compiles to C, and Dyna compiles to C++, which are then intended to be hacked up and/or embedded in larger programs. I imagine this is a maintainability nightmare, but could be fine for one-off projects.

Another interesting variation is to what extent the systems handle probabilistic relations. BUGS and HBC don’t really try at all beyond plates; Alchemy, BLOG, and Factorie basically specialize in this; Dyna kind of does in a way; and the rest I can’t tell.

In summary, lots of interesting variation here. Given how many of these things are new and changing, this area will probably look much different in a few years.

6 Comments

Seeing how “art” and “pharmaceuticals” are linguistically similar in web text

Earlier this week I asked the question,

How are “art” and “pharmaceuticals” similar?

People sent me lots of submissions! Some are great, some are a bit of a stretch.

  • Overpriced by an order of magnitude.
  • The letters of “art” are found embedded, in order, in “pharmaceuticals”.
  • Search keywords that cost the most to advertise on?
  • “Wyeth”: I think this means this, and this.
  • “Romeo and Juliet” famously includes both “art” (wherefore art thou) and pharmaceuticals (poison!)
  • Some art has been created out of pharmaceuticals.
  • Some art has been created under the influence of pharmaceuticals.

I was asking because I was playing around with a dataset of 100,000 noun phrases’ appearances on the web, from the Reading the Web project at CMU. That is, for a noun like “art”, this data has a large list of phrases in which the word “art” is used, across some 200 million web pages. For two noun concepts, we can see what they have in common and what’s different by looking at examples of how people use them when writing. So, for “art” versus “pharmaceuticals”:

common contexts for “art” but not “pharmaceuticals” [7394 total] common contexts for both “art” and “pharmaceuticals” [165 total] common contexts for “pharmaceuticals” but not “art” [206 total]
‘m into _
‘s interested in _
A collection of _
_ has been described by
structure of _
study in _
_ have been shown in
The knowledge of _
_ is a commodity
_ is a creation
_ is a world
an exhibition of _
the commercialization of _
the confinement of _
_ is cast in

areas such as _
prices of _
storage of _
producers of _
_ designed for
the provision of _
_ sold in
the same way as _
_ are among
The production of _
the analysis of _
advances in _
specialising in _
a career in _
_ stolen from

a greater amount of _
standards for _
marketer of _
market for _
prescriptions for _
the supply of _
the availability of _
advertising for _
the appropriate use of _
shipment of _
a cocktail of _
classes of _
a complete inventory of _
_ related downloads
new generations of _

The middle column, showing ways in which people talk about both “art” and “pharmaceuticals”, makes it pretty clear. What they have in common is that they’re both products: you can buy, sell, produce, and store them. (There’s also an intellectual goods aspect: they both can be stolen.) This really didn’t occur to me at first; silly me, I thought art was a thing of beauty removed from such mundane considerations. A number of the submitted answers, though, center around the theme of them both being expensive — so we have positive agreement between corpus statistics and human judgments!

Examining massive numbers of contexts like this follows what the infinitely wise Dinosaur Comics calls “a statistically-based descriptivist approach to semantics.” Or as linguist J.R. Firth put it, “You shall know a word by the company it keeps.” Many subtleties of the two concepts can be seen just in their context lists. For example, in the left column, we see that only art “is a commodity”. Well, certainly pharmaceuticals are a commodity too. But that’s so obvious it’s not worth saying. Proclaiming that “art is a commodity”, however, is interesting. Maybe we think about this (possible) fact more.

As for the data: it comes from 200 million web pages (500 million sentences), and is filtered to contexts that appear more than five hundred times in the data. It was collected as part of a research project that seeks to extract a database of knowledge from this information — “reading the web”. (Yes, Hadoop was involved.) To make the table, I took the contexts’ set differences and intersection and showed a random subsample from each.

A final note. Will pointed out that in Alice in Wonderland, the Mad Hatter asks, “Why is a raven like a writing desk?” I tried that query on this data, but unfortunately, it didn’t contain many instances of “raven”. However, it does include a proper name “Raven” — which turns out to be an anime character. Not the first time I’ve seen the Internet’s massive amount of anime knowledge get in the way of a very serious semantic extraction system!

Many thanks to Adam, Joanna, Will, Vikas, and Michael for the submitted answers.

3 Comments

Quiz: “art” and “pharmaceuticals”

A lexical semantics question:

How are “art” and “pharmaceuticals” similar?

I have a data-driven answer, but am curious how easy it is to guess it, and in what sense it’s valid. I’ll post my answer and supporting evidence on Tuesday.

1 Comment

Don’t MAWK AWK – the fastest and most elegant big data munging language!

update 2012-10-25: I’ve been informed there is a new maintainer for Mawk, who has probably fixed the bugs I’ve been seeing.

From: Gert Hulselmans

[The bugs you have found are] indeed true with mawk v1.3.3 which comes standard with Debian/Ubuntu. This version is almost not developed the last 10 years.

I now already use mawk v1.3.4 maintained by another developer (Thomas E. Dickey)
for more than a year on huge datafiles (sometimes several GB).

The problems/wrong results I had with mawk v1.3.3 sometimes are gone. In his version, normally all open/known bugs are fixed.

This version can be downloaded from: http://invisible-island.net/mawk/


update 2010-04-30: I have since found large datasets where mawk is buggy and gives the wrong result. nawk seems safe.


When one of these newfangled “Big Data” sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you’re dealing with hundreds of megabytes of data, even simple operations can take plenty of time.

For one recent ad-hoc task I had — reformatting 1GB of textual feature data into a form Matlab and R can read — I tried writing implementations in several languages, with help from my classmate Elijah. The results really surprised us:

Language Time (min:sec) Speed (vs. gawk) Lines of code Notes Type
mawk 1:06 7.8x 3 Mike Brennan’s Awk, system default on Ubuntu/Debian Linux. VM
java 1:20 6.4x 32 version 1.6 (-server didn’t matter) VM+JIT
c-ish c++ 1:35 5.4x 42 g++ 4.0.1 with -O3, using stdio.h Native
python 2:15 3.8x 20 version 2.5, system default on OSX 10.5 VM
perl 3:00 2.9x 17 version 5.8.8, system default on OSX 10.5 VM
nawk 6:10 1.4x 3 Brian Kernighan‘s “One True Awk”, system default on OSX, *BSD ?
c++ 6:50 1.3x 48 g++ 4.0.1 with -O3, using fstream, stringstream Native
ruby 7:30 1.1x 22 version 1.8.4, system default on OSX 10.5; also tried 1.9, but was slower Interpreted
gawk 8:35 1x 3 GNU Awk, system default on RedHat/Fedora Linux Interpreted

To be clear, Continue reading

48 Comments

Patches to Rainbow, the old text classifier that won’t go away

I’ve been reading several somewhat recent finance papers (Antweiler and Frank 2005, Das and Chen 2007) that use Rainbow, the text classification software originally written by Andrew McCallum back in 1996. The last version is from 2002 and the homepage announces he isn’t really supporting it any more.

However, as far as I can tell, it might still be the easiest-to-use text classifier package out there. You don’t have to program — just invoke commandline arguments — and it can accommodate reasonably sized datasets, does tokenization, stopword filtering, etc. for you, and has some useful feature selection and other options. Based on my limited usage, it seems well-implemented. If anyone knows of a better one I’d love to hear it. I once looked at, among other things, GATE and UIMA, and they seemed too hard to use if you wanted to download something that did simple text classification; or else, maybe they didn’t have documentation on how to use them in that manner. Rainbow does. If I had to recommend a text classifier to a social scientist today, I might say they should Rainbow.

(GATE and UIMA call themsleves “architectures”. I usually don’t want an architecture, I want a program that does stuff. LingPipe was the only other system I found that had good web documentation saying how to use it to do text classification. It looks like a good option, if you’re willing to write some code. There are numerous academic efforts to make automated content analysis systems that at a high level sound like the right sort of thing, but nearly all of them have poor web docs so it’s hard to tell whether they do what you want.)

In the meantime, the current Rainbow download has issues compiling on modern GCC and Mac OSX — some issues documented here. I worked through them put my patched version (only tested on GCC 4.0, OSX 10.5) up here: github.com/brendano/bow

5 Comments

Another R flashmob today

Dan Goldstein sends word they’re doing another Stackoverflow R flashmob today. It’s a neat trick. The R tag there is becoming pretty useful.

Leave a comment

Beautiful Data book chapter

Today I received my copy of Beautiful Data, a just-released anthology of articles about, well, working with data.  Lukas and I contributed a chapter on analyzing social perceptions in web data.  See it here. After a long process of drafting, proofreading, re-drafting, and bothering the publishers under rather sudden deadlines, I’ve resolved to never use graphics again in anything I write :)

Here’s our final figure, a k-means clustering of face photos via perceived social attributes (social concepts/types? with exemplars?):

I just started reading the rest of the book and it’s very fun.  Peter Norvig‘s chapter on language models is gripping.  (It does word segmentation, ciphers, and more, in that lovely python-centric tutorial style extending his previous spell correction article.)  There are also chapters by many other great researchers and practitioners (some of whom you may have seen around this blog or its neighborhood) like Andrew Gelman, Hadley Wickham, Michal Migurski, Jeffrey Heer, and still more… I’m impressed just by the talent-gathering-and-organizing operation. Big kudos to editors Toby Segaran and Jeff Hammerbacher, and O’Reilly’s Julie Steele.

I also have an apparently secret code that gets you a discount, so email me if you want it.  I wonder if I’m not supposed to give out many of them.  Hm.

4 Comments

Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features

I haven’t done a paper review on this blog for a while, so here we go.

Coreference resolution is an interesting NLP problem.  (Examples.)  It involves honest-to-goodness syntactic, semantic, and discourse phenomena, but still seems like a real cognitive task that humans have to solve when reading text [1].  I haven’t read the whole literature, but I’ve always been puzzled by the crop of papers on it I’ve seen in the last year or two.  There’s a big focus on fancy graph/probabilistic/constrained optimization algorithms, but often these papers gloss over the linguistic features — the core information they actually make their decisions with [2].  I never understood why the latter isn’t the most important issue.  Therefore, it was a joy to read

They describe a simple, essentially non-statistical system that outperforms previous unsupervised systems, and compares favorably to supervised ones, by using smart features.  It has two-ish modular components:

  • Syntactic constraints: entity type agreement, appositives, and a few other things that get at syntactic salience.
  • Semantic filter:  non-pronoun mentions must be corpus-pattern-compatible with their antecedents; described below.

For each mention, these constraints filter previous mentions to several possible antecedents.  If there’s more than one, the system picks the closest.  Entity clusters are formed in the simplest (dumbest) way possible, by taking the transitive closure of these pairwise mention-antecedent decisions.

The lexical semantic filter is interesting.  They found that syntactic cues have recall issues for non-pronoun references, e.g. the company referring to AOL.  You need to know that these two words tend to be compatible with each other.  They create a very specific lexical resource — of word pairs compatible to be coreferent — by finding coreferent expressions via bootstrapping in a large unlabelled text corpus (Wikipedia abstracts and newswire articles.  But they say only 25k Wiki abstracts?  There are >10 million total; how were they selected?).

Using a parsed corpus, they seeded with appositive and predicate-nominative patterns: I’m guessing, something like ”Al Gore, vice-president…” and “Al Gore was vice-president”.  Then they extracted connecting paths on those pairs.  E.g., the text “Al Gore served as the vice-president” then yields the path-pattern “X served as Y”.  Then there’s one more iteration to extract more word pairs — pairs that appear a large number of times.

They cite Hearst 1992Snow et al 2005, and Phillips and Riloff 2007.  But note, they don’t describe their technique as trying to be a general hypernym finder; rather, its only goal is to find pairs of words (noun phrase heads) that might corefer in text that the final system encounters.  In fact, they describe the bootstrapping system as merely trying to find instances of noun phrase pairs that exhibit coreference.  I wonder if it’s fair to think of the bootstrapping system as a miniature coreference detector itself, but tuned for high-precision, by only considering very short syntactic paths (no longer than 1 sentence).  I also wonder if there are instances of non-pronoun coreference that aren’t hyponym-hypernym pairs; if not, my analysis here is silly.

Coverage seems to be good, or at least useful: many non-pronoun coreference recall errors are solved by using this data.  (I can’t tell whether it’s two-thirds of all recall errors after the syntactic system, or two-thirds of the errors in Table 1.)  And they claim word pairs are usually correct, with a few interesting types of errors (that look quite solvable).  As for coverage, I wonder if they tried WordNet hypernym/synonym information, and whether it was useful.  My bet is that WordNet’s coverage here would be significantly poorer than a bootstrapped system.

This paper was fun to read because it’s written very differently than the usual NLP paper.  Instead of presenting a slew of modelling, it cuts to the chase, using very simple algorithms and clear argumentation to illustrate why a particular set of approaches is effective.  There’s lots of error analysis motivating design decisions, as well as suggesting concusions for future work.  In particular, they think discourse, pragmatics, and salience aren’t the most important issues; instead, better syntactic and semantic modelling would give the biggest gains.

There’s also something very nice about reading a paper that doesn’t have a single equation yet makes a point, and is easy to implement yourself to boot.  I think the machine learning approach to NLP research can really hurt insight.  Every paper is obsessed with held-out predictive accuracy.  If you’re lucky, a paper will list out all the features they used, then (only sometimes!) they make a cursory attempt at finding which features were important.  A simple hand-coded system lends itself to easily describing and motivating every feature by itself — better narrative explanations and insight.  Which type of research is more useful as science?

Final note: it’s not totally fair to consider this system a non-statistical one, because its syntactic and semantic subsystems rest on complicated statistical components that required boatloads of labelled training data — the Stanford parser, Stanford NER, and the Charniak parser.  (I wonder how sensitive performance is relative to these components.  Could rule-based parsing and NER work as well?)  Further, as they point out, more complicated structured approaches to the problem of forming entity partitions from features should improve performance.  (But how much?)

[1] As opposed to, say, the rarified activity of treebanking, a 318-page-complex linguistic behavior that maybe several dozen people on Earth are capable of executing.  There’s a whole other rant here, on the topic of the behavioral reality of various linguistic constructs.  (Treebank parsers were extensively used in this paper, so maybe I shouldn’t hate too much…)

[2] There are certainly exceptions to this like Bengston and Roth, or maybe Denis and Baldridge (both EMNLP-2008).  I should emphasize my impression of the literature is from a small subsample.

2 Comments

Blogger to WordPress migration helper

A while ago I moved my blog from Blogger (socialscienceplusplus.blogspot.com) to a custom WordPress installation here (anyall.org/blog).  Wordpress has a nice Blogger import feature, but I also wanted all the old URL’s to redirect to their new equivalents.  This is tricky because Blogger doesn’t give you much control over their system.  I only found pretty hacky solutions online, so I wrote a new one that’s slightly better, and posted it here if anyone’s interested: gist.github.com/15594

Leave a comment