Reading the Gradient: ML won’t solve Natural Language Understanding

After having received quite a few messages via the The Gradient mailing list, I finally get around to reading them and Machine Learning Won’t Solve Natural Language Understanding was a very well-argued post about why the current focus on Deep-Learning-based techniques for NLU are misguided.

Some snippets:

This misguided trend has resulted, in our opinion, in an unfortunate state of affairs: an insistence on building NLP systems using ‘large language models’ (LLM) that require massive computing power in a futile attempt at trying to approximate the infinite object we call natural language by trying to memorize massive amounts of data. In our opinion this pseudo-scientific method is not only a waste of time and resources, but it is corrupting a generation of young scientists by luring them into thinking that language is just data – a path that will only lead to disappointments and, worse yet, to hampering any real progress in natural language understanding (NLU). Instead, we argue that it is time to re-think our approach to NLU work since we are convinced that the ‘big data’ approach to NLU is not only psychologically, cognitively, and even computationally implausible, but, and as we will show here, this blind data-driven approach to NLU is also theoretically and technically flawed.

[N]atural language understanding by machines is difficult because of MTP – that is, because our ordinary spoken language in everyday discourse is highly (if not optimally) compressed, and thus the challenge in “understanding” is in uncompressing (or uncovering) the missing text – while for us humans that was a genius invention for effective communication, language understanding by machines is difficult because machines do not know what we all know. But the MTP phenomenon is precisely why data-driven and machine learning approaches, while might be useful in some NLP tasks, are not even relevant to NLU:

The equivalence between (machine) learnability (ML) and compressibility (COMP) has been mathematically established. That is, it has been established that learnability from a data set can only happen if the data is highly compressible (i.e., it has lots of redundancies) and vice versa (see this article and the important article “Learnability can be Undecidable” that appeared in 2019 in the journal Nature). While the proof between compressibility and learnability is quite technically involved, intuitively it is easy to see why: learning is about digesting massive amounts of data and finding a function in multi-dimensional space that ‘covers’ the entire data set (as well as unseen data that has the same pattern/distribution). Thus, learnability happens when all the data points can be compressed into a single manifold. But MTP tells us that NLU is about uncompressing.


Incorrectly defined cost in, incorrectly defined out (ICI2CO)

A group of Israeli researchers recently uploaded a preprint trying to quantify the “cost” of implementing a national lockdown instead of trace-and-test, and boy did they drop the ball.

The study rests on its the definition of “cost” and the definition use in the paper is extremely questionable. The authors equate the economic costs with the estimated decline in GDP. The GDP has a handful of now widely discussed methodological weaknesses — such as that polluting production is as much a part of the GDP as the elimination of that pollution, or that the production of advertising has a positive impact on the GDP — that stem from the fact that everything that is sold and bought is valued (with the well-known side effects that socially useful services such as housework and volunteerism are not included).

But this cost accounting then becomes blurred because the authors look at the “costs” of hospital care without making a clear distinction between the two. If the latter costs are expenditures of private persons, they could have a negative effect on their financial situation, which can be reduced or even completely neutralized by the state with support payments. If the authors refer to costs for drugs, equipment and personnel that are paid directly by the state, there are no negative financial effects for private individuals at all (at least as long as the state does not pay for these expenses in foreign currency).

In both cases, however, hospital costs are NOT costs in the sense of GDP reductions – on the contrary, any payment for a drug, a ventilator, or a nurse’s salary is positively reflected in GDP. (that the chronically ill are good for GDP is another absurdity of GDP calculation)

The fact that the total “costs” are so ambiguously defined also means that the per capita costs as a figure are rather worthless and not suitable as a basis for debate.

A much more interesting approach would be “quality of life” considerations, which the authors very briefly touch on: a complete lockdown without unconditional salary compensation for part-time unemployed people has serious consequences for their quality of life; the same applies to small (micro) enterprises, especially those that depend on daily customer traffic. These companies need non-repayable support to ensure their continued existence and minimize the financial damage to their owners.

Psychologically, lockdown experiences in various countries have shown that depression increased, as did anxiety, the perception of being lonely and resulting drug abuse of all kinds, especially among people who were already socially isolated before.

In other words, there are many important aspects to the question of “national lockdown vs. test-trace-isolate” and especially the question of *how* isolation and prevention of certain economic activities should be designed that could help decision makers but a headline-grabbing “45M US$ per life saved” is not one of them.

I might just not like press releases (or press write-ups)

My LinkedIn feed presented me with a write-up of work done at the US Department of Energy’s Pacific Northwest National Laboratory by a website called Verdic, titled A deep neural network is being harnessed to analyse nuclear events:

[T]he data is shrouded in external noise, which can hinder the discovery of more uncommon signals. Even a light switch being turned on in a building can produce noise and subsequently affect the data.

It’s not actually very informative, not giving any information about how “deep” the network is but since it is

running on a standard desktop computer

I suspect not too deep. Although it is of course entirely possible that it “runs on a desktop computer” for the (much cheaper) classification task but needs to be trained on something more powerful.

This doesn’t stop the write-up from proclaimining that

Deep learning is likely to become the AI technology that allows cognitive systems to surpass human intelligence for specific applications.

And I have to admit that I really don’t understand the described training procedure:

A sample of 32,000 pulses was used to adapt the network, programming it to learn the changing features the pulses exhibited that would be critical when interpreting the data. Jesse Ward then sent over thousands of additional pulses so that the network could begin to deduce what signals were good and which were bad; as time progressed, the more complex the pulses became.

What exactly is the difference between the first 32k pulses and how they were used, and the ones afterwards? It sounds to me a bit as if the first step is a feature construction one – maybe using an auto-encoder network – and the second one of discriminative learning. But yeah, far from clear.

There’s unfortunately no link to any scientific publication in the write-up, so I’ll add one from 2015: Bockermann et al. Online Analysis of High-Volume Data Streams in Astroparticle Physics (pdf) which tackles the problem of

A central problem in all these experiments is the distinction of the crucial gamma events from the background noise that is produced by hadronic rays and is inevitably recorded. This task is widely known as the gamma-hadron separation problem and is an essential step in the analysis chain.

i.e. something very similar to the problem described above. They did it by introducing

the fact-tools – our high-level framework to model the data flow, which integrates state of the art tools such as WEKA and MOA to incorporate machine learning for various tasks.

Enjoy the read!

“Fake” news can indeed fool this new algorithm, “fake” news are in the eye of the beholder, and why all of this is a problem

“Fake” news detection is a big topic ever since the 2016 US presidential elections and the Brexit vote, and the claims of the respective losing sides that people had been tricked by “fake” news to vote for the winner.

Now the University of California, Riverside put out a rather strongly worded press release, titled Fake News Can’t Fool New Algorithm. I am not going to comment on the method but I have serious misgivings about the evaluation – misgivings that are by far not limited to this paper (pdf) but that apply to the entire “fake” news detection setting.

My first problem is with this statement from the press release:

The team members put three sets of articles— two public datasets and their own collection of 63,000 news articles— through their algorithm and found that it accurately sorted articles into fake news categories 75 percent of the time.

I know that this is a press release and that those have a problem with representing research correctly but still:

  1. The data set the authors created and which they perform most of their experiments is highly imbalanced: 31,739 “fake” news and 409,076 “real” news articles. To get around this, they down-sample the majority class, which can be defended when it comes to training data (since one could know the articles’ labels at the time of model building) but not for test data (when labels are unknown)1.
  2. Given this ideal setting, where both categories are balanced, they then achieved a precision of 73%, i.e. 27% of of news classified as “fake” were actually “real” because they looked too similar – that’s a big problem because it risks censoring a lot of legit information that looks dissimilar to the mainstream.
  3. Furthermore, recall was only at 74%, i.e. 26% of “fake” news escaped detection.
  4. So an arguably better way of summarizing the performance of the method is: Under ideal conditions, the method gets more than one in four stories wrong. Or, in other words, “fake” news fool this new algorithm

Continue reading

The conceptual underpinnings of machine intelligence

I’ve just discovered a series of very interesting posts by Peter Sweeney on Medium, in which he interrogates the conceptual and arguably philosophical underpinnings of machine intelligence (or a bit more narrowly, the current research in machine learning).

Especially the last post got me thinking quite a bit because while he juxtaposes ML predictions and the generated “knowledge” with the scientific approach to knowledge generation, I’m a bit surprised, that he doesn’t mention active learning in this discussion, which to mean seems to be relatively close to the Popperian scientific approach: have a hypothesis, test an example for which the result is not clear (or that you expect to violate the hypothesis), adjust the hypothesis if necessary.

And I wouldn’t be me if I didn’t think that the goal of knowledge generation could be helped by a) using learned models to generate data (if possible) and sanity-check them, and b) generate artificial data that should give certain results and see what the model/approach makes of them.

In addition: I don’t work on goal-oriented ML (predictive learning or reinforcement learning) very much, even though this blog might lead one to believe otherwise, but instead on unsupervised data mining.
We like to flatter ourselves that the results of our techniques are hypothesis-generating in that they basically just point out: “this relationship is unexpected” or “there is indeed structure in the data that had not been defined before” and leave it to (in fact, require from) humans to interpret and derive the knowledge. As a precondition for this to work, our results have to be symbolic (as in pattern mining) or at least more-or-less interpretable (as in cluster memberships of data instances).
So I wonder where this would enter into his thoughts about explanations and creativity.
The other thing is bisociation  - there was a (honestly largely still-born) EU FP7 project a couple of years ago, the stated purpose of which was developing methods for mapping vocabulary/concepts in different research domains to each other, and perform pattern mining over this space.
Today’s research on heterogeneous networks (pdf) goes in a similar direction but requires already predefined connections between concepts/entities/data sources so any results are arguably not creative leaps.

Problems with Machine Learning: we’ve been here before

A friend of mine shared a blog post about data privacy issues in machine learning.

While it seems that the paper they talk about is pretty neat, this is still an immensely frustrating post to me. Membership inference seems to me to be similar to k-anonymity, a problem that was extensively studied at least 10 years ago. Yet that term doesn’t even appear in their paper on membership interference. And adversarial learning has been a concern for machine learning researchers long before the Deep Learning hype (and, yes, also before 2011).

A few days ago, that same friend had linked to an ars technica article about how Amazon’s face recognition matches lawmakers that are people of color to mugshots of people of color. This, again, is a problem that has been explored as long as ten years ago.

What’s truly remarkable to me is that such well-funded organizations as Google and Amazon have apparently ignored much of that work to go ahead and redo old mistakes. It feels a bit as if the advent of Deep Learning has been used to wipe the slate clean and rediscover a lot of known insights, retarding development.

IJCAI bidding is upon me

and as last year, it promises to be annoying.

  1. The two highest-ranked papers for me were on Deep Learning. Ranking is supposed to help us bid and supposedly based on a combination of keywords selected by PC members and analysis of papers that they upload.

    I have no knowledge to speak off of Deep Learning, I have never written a Deep Learning paper, and I didn’t select “Deep Learning” as a keyword!

  2. After having done my bids on the first 25 papers (which took time because the titles are not super-informative, so I read the abstracts), I wanted to move on to page 2, only to find out that the system had logged me out, losing all but three of my bids in the process

But at least I’ve encountered a strong contender for the most buzz-wordy title of the year: Improved Kernel Density Estimation Self-organizing Incremental Neural Network to Perform Big Data Analysis!

I’m a (we’re) gatekeeper(s)

I’ve just finished reviewing and discussing for SDM 2018. It’s a good conference, and they give a smaller reviewing load than the dysfunctional ICDM, for instance. I had eight papers to review and it turned out that they were all rejects, either because the ideas were a bit half-baked, or (in the majority of cases) because they were unaware of important related work and therefore didn’t discuss/compare.1

SDM doesn’t blind the reviewers to each other, and I noticed that for the seven papers where I could see others’ reviews, I knew between one and three (out of three) of my coreviewers personally. In some cases, I knew the meta-reviewer as well. Now, I feel that our reviews and decisions were justified2 but if we (as a group of researchers knowing each other) simply didn’t like the direction of the work, for instance, we would have been in the position to block it.

In a sense, it’s unavoidable that a sitution like this occurs in peer review but it becomes more likely if the (sub)field is somewhat specialized and there’s only a certain number of researchers working in it on a high-enough level to be invited as reviewers.

1 This is a side-effect of the publish-or-perish mechanisms: we publish way too much in our field, which makes it often very hard to know all the relevant related work in the first place – especially when one is a PhD student. But letting such papers get published would only worsen the problem.

2 Although this introduces a chicken-egg problem: one of the reasons that I trust the others’ reviews is because I know and respect them and their knowledge.

Great accuracy + forgetting to bet = slight losses

Week Naive Bayes (Avg+OAvg) Naive Bayes (Avg) ANN (Avg + OAvg) ANN (Avg) Neural Network (Adj)
Week 13 12/16 12/16 12/16 9/16 13/16
Through week 13 112/175 115/175 107/175 100/175 105/175

Look at those accuracies! 75% for the classifier I use to bet, as for two others, even 81.25% for the ANN with adjusted statistics. Yet I still lost ~50 euros but this time this is mainly due to me – forgot both to bet the (correctly predicted) Seattle-over-Philadelphia upset and the MNF match (also correctly predicted). The biggest payout was Minnesota-over-Atlanta, btw, a match that the latter two ANN classifiers got wrong.
I never forgot to bet last year, probably because there’s was actually a chance winning – this year, I am just trying to claw some money back, and feel stymied at every turn. 🙂
Finally, purely in accuracy terms, past trends show up again – the Naive Bayes classifiers stand head-and-shoulders above the rest – 64%/65.71% vs 61.14%/57.14%/60%.

The end’s getting closer

Week Naive Bayes (Avg+OAvg) Naive Bayes (Avg) ANN (Avg + OAvg) ANN (Avg) Neural Network (Adj)
Week 11 10/14 10/14 10/14 11/14 8/14
Week 12 13/16 13/16 9/16 12/16 13/16
Through week 12 100/159 103/159 95/159 91/159 92/159

Will you look at those accuracies for the NB models: 71% and 81.25%! And the later won me exactly 19 euros! 😦 There where three matches I couldn’t bet on because the favorite’s odds were too low, I missed two upsets, and incorrectly predicted another one.

Apart from that, I tallied up theoretical winnings through week 11 last week, i.e. if I’d bet US$ 100 per match, for four models:

Week Naive Bayes (Avg+OAvg) Naive Bayes (Avg) ANN (Avg + OAvg) ANN (Avg)
Accuracy through week 11 (%) 60.84 62.94 60.14 55.24
Winnings through week 11 (US$) 28.65 203.02 1018.23 -1484.21
Underdogs correct through week 11 13 12 22 15

I am not surprised by this, it’s absolutely in line with what I observed in the past two seasons. But damn, it hurts, and I have still no idea how to decide which model to pick at the beginning of the season.