Natural Language Processing


Algorithms, Biotechnology, Programming Language


Natural language processing, or NLP, is a type of artificial intelligence that deals with analyzing, understanding, and generating natural human languages so that computers can process written and spoken human language without using computer-driven language. Natural language processing, sometimes also called “computational linguistics,” uses both semantics and syntax to help computers understand how humans talk or write and how to derive meaning from what they say. This field combines the power of artificial intelligence and computer programming into an understanding so powerful that programs can even translate one language into another reasonably accurately. This field also includes voice recognition, the ability of a computer to understand what you say well enough to respond appropriately.


It has long been a dream of scientists, inventors, and computer programmers to make a robot, computer, or program, such as a voice response program, that can be mistaken for a human. Alan Turing once said, “A computer would deserve to be called intelligent if it could deceive a human into believing it was human.” One of the roadblocks to creating a machine like this is that human language has been nearly impossible for machines to understand and respond to appropriately.

That hasn't stopped people from trying. Many early science fiction stories are based on the idea of a robot as a human. Early in the 1950s, programmers attempted to get computers to be able to understand language enough to be able to translate from one language to another. Success was limited. An undocumented (and probably not exactly true) story that is told about early attempts to have computers understand human language correctly goes like this: A programmer typed “The spirit is willing but the flesh is weak” into a computer program that was supposed to translate the sentence into Russian, which it did. Then the programmer asked the computer to translate the Russian sentence back into English. The result was “The vodka is good, but the meat is rotten.” As you can see, this makes some sense if you read the original sentence very literally, but the meaning of the sentence was completely lost.


Humans learn and use language in a way that is difficult for computers to understand. As a simple example, here is a sentence that might mean different things: “Baby swallows fly.” Is “baby” a noun or an adjective? Is “swallows” a verb or a noun? Is “fly” a noun or a verb? Depending on the context of the conversation, a human is likely to understand this ambiguous sentence. However, a computer, without any anthropomorphic understanding, is not likely to understand the linguistic structure.

Computer programmers have made great strides in this field. They have combined the linguistic fields of semantics and syntax with powerful computer programs using neural network processes that “learn” how to look for the same kinds of signals humans look for to create meaning. What words are surrounding the words we want to understand? In our example above, if “birds” are mentioned anywhere around the phrase, the computer program can “understand” that this sentence is talking about small birds that fly. If “insect” or “child” is mentioned anywhere around the sentence, the computer can “understand” that this sentence means that a small child ate an insect. This is a very simple example that doesn't really give the scope of the power behind computer programs that perform natural language processing but is easy to understand.


Programmers use a variety of techniques to help machines understand natural language. For example, automatic summarization consists of two techniques, extraction or abstraction. Extraction is a technique that attempts to extract the most important segments of the text and make a summary list of it. Abstraction, which is much more complex, involves writing a summary of the information.

Sentiment analysis tries to identify the emotions conveyed in a text. For example, on a trip review site, a program would try to identify whether a review was positive or negative based on the words used in the review, such as “liked,” “enjoyed,” “unhappy,” or “problem.”

Text classification is a way to assign predefined categories to a text. For example, your email spam detector determines whether a message is spam or something you actually want to see, with various degrees of success. Other types of text classification techniques organize, for example, news stories by topics, such as sports or headlines. There are text classification programs, called author attribution, that are so sophisticated that they can determine who wrote the text based on the style of writing, word frequency, vocabulary richness, phrase structure, and sentence length. Some people have used a program like this to determine, for example, whether Shakespeare really wrote all of the plays attributed to him.

Conversational agents are systems that attempt to have a conversation with a human. You may have seen this in customer service situations, where it is often called a “chatbot.” You may not recognize right at first that you are not talking to a human, but the program often gives itself away—eventually you realize that you are not chatting with a human based on an unpredictable response in the dialogue.


It is almost impossible to imagine a world without computers translating one language into another. Programs like Google Translate exist for translating nearly any language into nearly any other language. Even FaceBook has a “translate” feature where if your friends type in Spanish, you can read it in English. As you have probably noticed, it's not perfect, but it's generally pretty close. This kind of translation works best in fields or with text where the vocabulary is well known and there are few idioms used. For example, a machine translation of a technical manual could work well, but a translation of a novel or short story is often almost comical.

Natural language processing as voice recognition is also everywhere. The iPhone feature “Siri” takes a question that you ask and makes enough sense of it to be able to answer your question adequately most of the time. You can speak to your phone and ask it to call someone for you or give you directions to get somewhere. Other applications are when your email recognizes an event and suggests you add it to your calendar (called information extraction). These types of features seemed like science fiction at one time but are now part of our everyday lives.

—Marianne Moss Madsen, MS

Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python, 2nd ed. O'Reilly Media, 2017.

Clark, Alexander, Chris Fox, and Shalom Lappin (eds.) The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2012.

Flach, Peter. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press, 2012.

Ingersoll, Grant S., Thomas S. Morton, and Drew Farris. Taming Text: How to Find, Organize, and Manipulate It. Manning Publications, 2013.

Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, PEL, 2008.

Kumar, Ela. Natural Language Processing. I K International Publishing House, 2011.

Mihalcea, Rada and Radev Dragomir. Graph-based Natural Language Processing and Information Retrieval. Cambridge University Press, 2011.

Watanabe, Shinji, and Jen-Tzung Chien. Bayesian Speech and Language Processing. Cambridge University Press, 2015.