Why NLP is Looking in the Wrong Direction

8 min readJul 30, 2020

This article is an extension to Atomos. If you want to know more about Atomos as a concept-oriented language see parts I and II. Atomos gets its name here.

Language is, in many ways, a human-designed technology. It was developed by us for recording and transmitting ideas and thoughts. What I am curious about is how effective this technology is at doing what it was designed to do? For any other technology is to be worth its silicon it needs to be intuitive, consistent, and perhaps contain measurable metrics if we want to make periodic updates to sustain it as a product.

Measuring the effectiveness of the English language for concept transport is tricky and you will see why soon. However, a few assumptions can go a long way to get a better view of the problem. Let’s start with a rough calculation. Our goal is to see how well equipped a complete sentence might be for holding an idea.

Comparing Information Carrying Capacity

To do this, we will compare how much information a typical sentence in English can carry vs a concept or our standard to meet. This should shed some light on how well a typical sentence performs at its concept carrying duties.

We will assume the following:

Assumption #1 — A complete sentence is generally a complete thought and is equal to a concept.
Assumption #2 — On average, there are 15–20 words per sentence
Assumption #3 — On average, there are 1.66 syllables per word
Assumption #4 — On average, there are 4.79 letters per word

To find the “fitness of an English sentence” we are going to compare dimensionality between a sentence and a concept. In other words, if your concept has 3 dimensions or units of information and your chosen sentence does as well, then you can convey your concept exactly.

A theoretical model, like Newton’s universal law of gravitation, will contain the same dimensions on both sides of the equation. An empirical model, every regression equation ever, will not be dimensionally consistent. We want to try to make an educated guess at how many dimensions are on each side of the equation and hope the number for each is within arms reach. To be clear, dimensions will be our predictor variables.

If over summarized, no amount of data ever will give us a good set of predictor variables. An exact match would be a solid contender for deriving a theoretical equation (don’t get your hopes up), and if it is close we have a shot at a decently correlated regression equation. Case 4 would probably interest educators.

Step 1 — How many predictor variables/dimensions are available in a typical English sentence?

Since a typical sentence contains between 15–20 words we will just calculate and upper and lower bound for the dimensionality of a sentence. For the least conservative case, we will assume that every word is free to vary and every letter of every word is free also.

Unfortunately, there isn’t a complete and or coherent list of English grammar rules. Some rules may act as independent variables while others act as constraints. To account for some of these rules we will use syllable count. Realistically, every letter won’t vary independently. They seem to be largely controlled by syllables so instead of word count for significant variables our lower bound will use syllables.

From the above calculation, we can see that for an average English sentence there are between 25 and 90+ predictor variables available. That sounds like quite a bit, right? Let’s see what the other side looks like.

Step 1.info — What do dimensions, predictor variables, and degrees of freedom have in common?

Slight detour! Skip ahead if you are comfortable with predictor variables as dimensions and degrees of freedom (dof). You may have noticed that I am using the words dimension, predictor variable, and soon degrees of freedom (dof) interchangeably. This is because they are the same mathematically. I hope that the figure below helps make this more intuitive.

Looking at the plane below you can see that there are 3 axes and then a rotation around each for a total of 6 degrees of freedom. In the simplest sense, a degree of freedom (df) is an axis of movement independent from any other axis (independent variable).

Physical dimensions are easier to visualize than statistical ones but they are the same mathematically. If you had a dataset with 250 predictor variables then your data has 250 dimensions or 250 dof (assuming they were truly independent). Most predictor variables have some correlation with each other, which is called intercorrelation, so dimensional reduction is usually warranted.

If you had 100 truly independent dimensions it would be much harder to visualize. It would be like having 100 lines going through the origin and somehow all being perpendicular to each other to make some strange hyperdimensional blob. No worries though, a calculator has no problem with it.

Step 2 — How many dimensions might a concept have?

Again, we are comparing the dimensionality of a sentence and a concept. There is any number of ways to do this but an idealistic workflow might start with a concept. With that, our minds automagically create a sentence using words you think fit and grammar you only understand intuitively.

Concepts vary widely so let’s choose something simple and make this problem more concrete. How about my personal love for coffee?

If you haven’t noticed I have been using Atomos’ notation to make sense of this problem. Although, this exercise is simple enough for a general mind map to take care of.

Above, the sentence “I love coffee”, is our example. If you picked a note off a desk and read this, would there be any way to split apart the letters, all three words, and few syllables to eventually guess that I like a light roast coffee because it contains the most caffeine for my brew style? It would never happen. Too much information is completely lost. A computer could only ever hope to ballpark it if they had a working knowledge of the context of the situation.

For the contextual information missing, how many dimensions might that amount to? Well, how many dimensions describe a person, their personality, or what love means to them? Equally, how many dimensions does it take to describe all the varieties of coffee and brewing styles? I don’t know myself but if we counted I wouldn’t be surprised if it scored in the thousands.

Part 3 — Pulling it together and other problems

So where did we end up? For our example, it looks like over summarization.

Fortunately, this is just one case. A much larger sample would give us a better idea, right? No, not really. If we revisit our assumptions it will become more clear. We missed a couple of considerations that make this problem much more difficult. For example, have you ever wondered why spell and grammar checking isn’t better? Well, they are pretty good at spelling and grammar. But, that isn’t what they fail on or struggle with. Spelling and grammar checkers fail on meaning (semantics). In the example, how much information is in the sentence “I love coffee”? What would be the difference if you had never read the note? It sounds okay but it conveys almost no useful information. It’s like your friend telling you the title to a book you should read and then nothing else. Unfortunately, most of our 25–96 predictor variables are generally quite meaningless (literally). They tell us nothing about the concept being carried by the sentence.

Grammar as predictor variables are not good at predicting

Looking above we can zoom in on part of the problem. Conceptually, or semantically, a cup, glass, and mug are very similar. However, lexicographically they are about as dissimilar as any other word you could randomly pick out the dictionary. This makes our favorite language essentially patternless. Without patterns, rote memorization is required to learn it. Now I see why we take 12+ years of English classes.

It seems we have gotten impressively good at reading between the lines but not enough to notice that the information we are looking for simply isn’t there. Our English API may be out of date but know that I’m not suggesting we just toss it out the window. We just need to start seeing the problem more clearly so we can address it. It isn’t one we want to let get worse. Think about where we’d all be today if our dial-up modems never got did better than 56 kbps. Would the internet be a thing?

If NLP is looking in the wrong direction then where is the right direction?

We have a pretty good understanding of the dimensionality of our language but a surprisingly narrow understanding of the concepts that are used to generate it. A good place to start would be the other side of this equation.

Summary

Speaking of the internet. Its invention and subsequent connectivity have mercilessly thrown us all together into a giant frying pan. All our ideas, dreams, beliefs, feelings, and thoughts are there at the boiling point on the Earth’s surface. And, no, I am not talking about climate change this round. I am talking about the imbalance in global idea creation vs idea naming and organization. This imbalance perpetuates inefficient communication. This metaphor is interesting to me because it looks strikingly similar to a tale I heard in science class as a kid. The story talked about steamy oceans on a newly formed earth bubbling with all the chemical ingredients needed for non-life to chemically evolve into life, people, and eventually frozen yogurt stores.

After a few hundred million years of doing its best, nature devised the beginnings of a highly efficient and adaptable language that we call DNA. While our frying pan is cooking a different dish, I wonder if the aim is suspiciously similar. Is the cold universe suggesting we self-organize and get busy on creating and organizing a new language, or is it Mother Nature telling us to clean our room?