by Allison Parrish
In this chapter, we’ll use a Python library called TextBlob to perform simple natural language processing tasks.
“Natural Language Processing” is a field at the intersection of computerscience, linguistics and artificial intelligence which aims to make theunderlying structure of language available to computer programs for analysisand manipulation. It’s a vast and vibrant field with a long history! Newresearch and techniques are being developed constantly.
The aim of this chapter is to introduce a few simple concepts and techniquesfrom NLP—just the stuff that’ll help you do creative things quickly, andmaybe open the door for you to understand more sophisticated NLP concepts thatyou might encounter elsewhere.
The most commonly known library for doing NLP in Python is NLTK. NLTK is a fantastic library, but it’s also a writhing behemoth: largeand slippery and difficult to understand. TextBlob is a simpler, more humaneinterface to much of NLTK’s functionality: perfect for NLP beginners or poetsthat just want to get work done.
Natural language
“Natural language” is a loaded phrase: what makes one stretch of language“natural” while another stretch is not? NLP techniques are opinionated aboutwhat language is and how it works; as a consequence, you’ll sometimes findyourself having to conceptualize your text with uncomfortable abstractions inorder to make it work with NLP. (This is especially true of poetry, whichalmost by definition breaks most “conventional” definitions of how languagebehaves and how it’s structured.)
Of course, a computer can never really fully “understand” human language.
Even when the text you’re using fits the abstractions of NLP perfectly, theresults of NLP analysis are always going to be at least a little bitinaccurate. But often even inaccurate results can be “good enough”—and in anycase, inaccurate output from NLP procedures can be an excellent source of thesublime and absurd juxtapositions that we (as poets) are constantly in searchof.
The English Speakers Only Club
The main assumption that most NLP libraries and techniques make is that thetext you want to process will be in English. Historically, most NLP researchhas been on English specifically; it’s only more recently that serious workhas gone into applying these techniques to other languages. The examples inthis chapter are all based on English texts, and the tools we’ll use aregeared toward English. If you’re interested in working on NLP in otherlanguages, here are a few starting points:
- Konlpy, natural language processing inPython for Korean
- Jieba, text segmentation and POS tagging inPython for Chinese
- The Pattern library (like TextBlob, asimplified/augmented interface to NLTK) includes POS-tagging and somemorphology for Spanish in itspattern.es package.
English grammar: a crash course
The only thing I believe about English grammar is this:
“Oh yes, the sentence,” Creeley once told the critic Burton Hatlen, “that’swhat we call it when we put someone in jail.”
There is no such thing as a sentence, or a phrase, or a part of speech, or evena “word”—these are all pareidolic fantasies occasioned by glints of sunlightwe see on reflected on the surface of the ocean of language; fantasies that wecomfort ourselves with when faced with language’s infinite and unknowablevariability.
Regardless, we may find it occasionally helpful to think about language usingthese abstractions. The following is a gross oversimplification of both howEnglish grammar works, and how theories of English grammar work in the contextof NLP. But it should be enough to get us going!
Sentences and parts of speech
English texts can roughly be divided into “sentences.” Sentences are themselvescomposed of individual words, each of which has a function in expressing themeaning of the sentence. The function of a word in a sentence is called its“part of speech”—i.e., a word functions as a noun, a verb, an adjective, etc.Here’s a sentence, with words marked for their part of speech:
I | really | love | entrees | from | the | new | cafeteria. |
pronoun | adverb | verb | noun (plural) | preposition | determiner | adjective | noun |
Of course, the “part of speech” of a word isn’t a property of the word itself.We know this because a single “word” can function as two different parts of speech:
I love cheese.
The word “love” here is a verb. But here:
Love is a battlefield.
… it’s a noun. For this reason (and others), it’s difficult for computers toaccurately determine the part of speech for a word in a sentence. (It’sdifficult sometimes even for humans to do this.) But NLP procedures do theirbest!
Phrases
A sentence can be divided into groups of words that work as units, or “phrases.” Phrases themselves might have smaller phrases within them, forming a tree structure. Linguists traditionally draw the structure of a sentence out in an actual tree:
A declarative sentence consists of a noun phrase (the subject) and a verbphrase (the predicate). The verb phrase has a verb, followed (optionally, ifthe verb is transitive) by a noun phrase. A “noun phrase” is basically thenoun, plus all of the stuff that surrounds and modifies the noun, likeadjectives, relative clauses, prepositional phrases, etc. Noun phrases arehandy things to be able to detect and extract, since they give us an idea ofwhat a text might be “about.”
NLP libraries give us tools to parse sentences into trees like this, andextract phrases from the sentence according to what kind of phrase it is.Notably, TextBlob makes extracting noun phrases super easy.
But note: if computers are bad at identifying parts of speech, they’re evenworse at parsing grammar in sentences. You will get some wonky results, so beprepared.
Morphology
“Morphology” is the word that linguists use to describe all of the weird waysthat individual words get modified to change their meaning, usually withprefixes and suffixes. e.g.
do -> redosing -> singingmonarch -> monarchyteach -> taught
A word’s “lemma” is its most “basic” form, the form without any morphologyapplied to it. “Sing,” “sang,” “singing,” are all different “forms” of thelemma sing. Likewise, “octopi” is the plural of “octopus”; the “lemma” of“octopi” is octopus.
“Lemmatizing” a text is the process of going through the text and replacingeach word with its lemma. This is often done in an attempt to reduce a textto its most “essential” meaning, by eliminating pesky things like verb tenseand noun number.
Pluralization
There’s one super important kind of morphology, and that’s the rules for takinga singular noun and turning it into a plural noun, like:
cat -> cats (easy)cheese -> cheeses (also easy...)knife -> knives (complicated)child -> children (huh?)
Even though we as humans employ plural morphology in pretty much everysentence, without a second thought, the rules for how to do it are actuallyreally weird and complicated (as with all elements of human language). Acomputer is never going to be able to 100% accurately make this transformation.But, again, some NLP libraries try.
Using TextBlob
At this point, let’s actually start using TextBlob so we can put this boringtheory into practice.
Installation
To use textblob, we of course need to install it! And to install it, we need tocreate or activate a virtualenv.
$ virtualenv venv$ source venv/bin/activate
Then, install TextBlob with pip
:
$ pip install textblob
Wait as things happen. When it’s all done, you might need to download theTextBlob corpora; to do so, type the following on the command line:
$ python -m textblob.download_corpora
Sentences, words and noun phrases
Let’s play with TextBlob in the interactive interpreter. The basic steps forusing TextBlob are:
- Create a TextBlob object, passing a string with the text we want to work with.
- Use various methods and attributes of the resulting object to get at various parts of the text.
For example, here’s how to get a list of sentences from a string:
>>> from textblob import TextBlob>>> blob = TextBlob("ITP is a two-year graduate program located in the Tisch School of the Arts. Perhaps the best way to describe us is as a Center for the Recently Possible.")>>> for sentence in blob.sentences:... print sentenceITP is a two-year graduate program located in the Tisch School of the Arts.Perhaps the best way to describe us is as a Center for the Recently Possible.
The .sentences
attribute of the resulting object is a list of sentences in the text. (Much easier than trying to split on punctuation, right?)
Each sentence object also has an attribute .words
that has a list of words inthat sentence.
>>> from textblob import TextBlob>>> blob = TextBlob("ITP is a two-year graduate program located in the Tisch School of the Arts. Perhaps the best way to describe us is as a Center for the Recently Possible.")>>> for word in blob.sentences[1].words:... print wordPerhapsthebestwaytodescribeusisasaCenterfortheRecentlyPossible
The TextBlob object also has a “.noun_phrases” attribute that simply returns the text of all noun phrases found in the original text:
>>> from textblob import TextBlob>>> blob = TextBlob("ITP is a two-year graduate program located in the Tisch School of the Arts. Perhaps the best way to describe us is as a Center for the Recently Possible.")>>> for np in blob.noun_phrases:... print npitptwo-year graduate programtischrecently
(As you can see, this isn’t terribly accurate, but we’re working with computershere. What are you going to do.)
“Tagging” parts of speech
TextBlob can also tell us what part of speech each word in a text correspondsto. It can tell us if a word in a sentence is functioning as a noun, anadjective, a verb, etc. In NLP, associating a word with a part of speech iscalled “tagging.” Correspondingly, the attribute of the TextBlob
objectwe’ll use to access this information is .tags
.
>>> from textblob import TextBlob>>> blob = TextBlob("I have a lovely bunch of coconuts.")>>> for word, pos in blob.tags:... print word, posI PRPhave VBPa DTlovely JJbunch NNof INcoconuts NNS
This for
loop is a little weird, because it has two temporary loopvariables instead of one. (The underlying reason for this is that .tags
evaluates to a list of two-item tuples, which we can automatically unpackby specifying two items in the for loop. Don’t worry about this if it doesn’tmake sense. Just know that when we’re using the .tags
attribute, you needtwo loop variables instead of one.) The first variable, which we’ve calledword
here, contains the word; the second variable, called pos
here,contains the part of speech.
Here’s a brief list of what the tags mean. You can see a more complete list of POS tag meanings here.
- NN: noun
- JJ: adjective
- IN: preposition
- VB_: verb (the
_
gets replaced with various letters depending on the form of the verb)
Pluralization
The TextBlob library comes with a built-in kind of object called Word
. Ifyou create a Word
object, you can use its .pluralize()
method to get theplural form of that word:
>>> from textblob import Word>>> w = Word("university")>>> print w.pluralize()universities
Word
objects also have a .lemmatize()
method, which returns the word, but with all morphology (suffixes, etc.) removed.
>>> from textblob import Word>>> w = Word("running")>>> print w.lemmatize()running
All items in the .words
and .tags
attribute of a sentence are also secretlyWord
objects, so you can call the .pluralize()
method on those items when(e.g.) looping through a list:
>>> from textblob import TextBlob>>> blob = TextBlob("I spy a lion, a tiger, and a bear.")>>> sentence = blob.sentences[0]>>> for word, pos in sentence.tags:... if pos == 'NN':... print word.pluralize()spieslionstigers
Some examples
Print only short sentences
This example takes an arbitrary input text and parses it into sentences. Itthen prints out ten random sentences from the text that have five or fewerwords.
from textblob import TextBlobimport randomimport sys# stdin's read() method just reads in all of standard input as a string;# use the decode method to convert to ascii (textblob prefers ascii)text = sys.stdin.read().decode('ascii', errors="replace")blob = TextBlob(text)short_sentences = list()for sentence in blob.sentences: if len(sentence.words) <= 5: short_sentences.append(sentence.replace("\n", " "))for item in random.sample(short_sentences, 10):print item
Program: hemingwayize.py
Here’s what it looks like when using Pride and Prejudice as input:
$ python hemingwayize.py < austen.txt"No, not at all.""Oh!"Ten thousand pounds!.said Frank Churchill.impossible!"'Lord!'But the effort was painful.But what shall I say?Let Wickham be _your_ man.
Dealing with Unicode errors
Wait, what’s all that decode('ascii', errors="replace")
business there?It turns out that TextBlob is really finicky about text: if it gets anythingexcept plain ASCII text, it has a propensity to return strange errors. Thedecode
method can be called on any string; you can use it to remove anynon-ASCII characters from a string, making it safe for TextBlob. Use replace
to replace non-ASCII characters with a placeholder, or ignore
to remove thementirely. For example:
>>> string_with_stuff = "weird \xe1ccents">>> print string_with_stuff.decode('ascii', errors="ignore")weird ccents
Turn any text into a list of instructions
This program extracts all noun phrases from a text, and all verbs; it thenrandomly selects from these lists to generate dada instructions from the text.
from textblob import TextBlobimport sysimport randomtext = sys.stdin.read().decode('ascii', errors="replace")blob = TextBlob(text)noun_phrases = blob.noun_phrasesverbs = list()for word, tag in blob.tags: if tag == 'VB': verbs.append(word.lemmatize())for i in range(1, 11): print "Step " + str(i) + ". " + random.choice(verbs).title() + " " + \ random.choice(noun_phrases)
Program: instructify.py
Using some H. P. Lovecraft as input:
$ python instructify.py < lovecraft.txtStep 1. Sail thick forestsStep 2. Come new yearningsStep 3. Lessen long grassStep 4. Buy books menStep 5. Come kuranesStep 6. Prove new nameStep 7. Exist shipStep 8. Sail celephaisStep 9. Remember white summitStep 10. Be old world
Create a “summary” of a text
This program “summarizes” a text in a very basic way. It does so by examiningthe part of speech of each word, and appending the word to a list if the wordis a noun; it then prints out five random nouns from the text in plural form.
from textblob import TextBlob, Wordimport sysimport randomtext = sys.stdin.read().decode('ascii', errors="ignore")blob = TextBlob(text)nouns = list()for word, tag in blob.tags:if tag == 'NN':nouns.append(word.lemmatize())print "This text is about..."for item in random.sample(nouns, 5):word = Word(item)print word.pluralize()
Program: summarize_nouns.py
Using Lovecraft as input, once again:
$ python summarize_nouns.py < lovecraft.txtThis text is about...menworldsseasflowsautumns
Wordnet
TextBlob also provides an interface to WordNet data. WordNet is basically a computer-readable thesaurus. You can use it onlinehere.
Wordnet’s most basic unit is the synset. A synset is essentially a “group” ofdifferent words that all mean the “same” thing. For example, wordnet mightgroup “kitty,” “cat,” “feline,” etc. into a synset, categorized under theabstract concept of CAT.
You can get a list of synsets that wordnet thinks a word belongs to like so:
>>> from textblob import Word>>> bank = Word("bank")>>> synsets = bank.synsets>>> print synsets[Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]
As you can see, the word “bank” belongs to many different synsets! We can printout WordNet’s definition for a synset like so:
>>> from textblob import Word>>> synsets = Word("bank").synsets>>> for synset in synsets:... print synset.definition()sloping land (especially the slope beside a body of water)a financial institution that accepts deposits and channels the money into lending activitiesa long ridge or pilean arrangement of similar objects in a row or in tiersa supply or stock held in reserve for future use (especially in emergencies)the funds held by a gambling house or the dealer in some gambling gamesa slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal forcea container (usually with a slot in the top) for keeping money at homea building in which the business of banking transacteda flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)tip laterallyenclose with a bankdo business with a bank or keep an account at a bankact as the banker in a game or in gamblingbe in the banking businessput into a bank accountcover with ashes so to control the rate of burninghave confidence or faith in
You can restrict which synsets to retrieve for a word for that word only whenused as a particular part of speech (say noun or verb).
>>> from textblob import Word>>> from textblob.wordnet import NOUN>>> synsets = Word("bank").get_synsets(pos=NOUN)>>> for synset in synsets:... print synset.definition()sloping land (especially the slope beside a body of water)a financial institution that accepts deposits and channels the money into lending activitiesa long ridge or pilean arrangement of similar objects in a row or in tiersa supply or stock held in reserve for future use (especially in emergencies)the funds held by a gambling house or the dealer in some gambling gamesa slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal forcea container (usually with a slot in the top) for keeping money at homea building in which the business of banking transacteda flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
We can also take any synset and use its .lemma_names()
method to get all ofthe words belonging to the synset, essentially giving us a list of synonyms(words that mean the same thing). Let’s find synonyms for “bank” in the senseof “financial institutions” (element 1 from the list above):
>>> from textblob import Word>>> from textblob.wordnet import NOUN>>> synsets = Word("bank").get_synsets(pos=NOUN)>>> print synsets[1].lemma_names()[u'depository_financial_institution', u'bank', u'banking_concern', u'banking_company']
Thanks, WordNet!
Example: Synonymize
This program reads in a ext, parses it into words, and then replaces each word with a random synonym (according to WordNet). Only words with threeor more letters that have synonyms in WordNet are replaced.
from textblob import Wordimport sysimport randomfor line in sys.stdin: line = line.strip() line = line.decode('ascii', errors="replace") words = line.split(" ") output = list() for word_str in words: word_obj = Word(word_str) if len(word_str) > 3 and len(word_obj.synsets) > 0: random_synset = random.choice(word_obj.synsets) random_lemma = random.choice(random_synset.lemma_names) output.append(random_lemma.replace('_', ' ')) else: output.append(word_str) print " ".join(output)
Program: synonymize.py
$ python synonymize.py < sea_rose.txtRose, harsh rose,scarred and with stretch of petals,meager flower, thin,spare of leaf,more wantedthan a wet climb upsingle on a stem --you are catch in the drift.Stunted, with minuscule leaf,you are fling on the sand,you are liftin the sharp sandthat drive in the wind.Can the spice-rosedrop such acid fragrancyharden in a leaf?
Further reading
TK
As someone deeply immersed in the field of Natural Language Processing (NLP) and Python programming, I can confidently attest to the significance of the concepts introduced in Allison Parrish's article. My extensive experience in NLP, coupled with hands-on work using libraries like TextBlob and NLTK, positions me as an authority on the subject.
Now, let's delve into the key concepts discussed in the article:
1. Natural Language Processing (NLP):
- NLP is a multidisciplinary field at the intersection of computer science, linguistics, and artificial intelligence.
- Its goal is to make the underlying structure of language accessible to computer programs for analysis and manipulation.
2. TextBlob and NLTK:
- TextBlob is introduced as a Python library for performing simple NLP tasks. It serves as a more user-friendly interface to NLTK's functionality.
- NLTK is acknowledged as a powerful NLP library, but TextBlob is recommended for beginners or those seeking a more straightforward approach.
3. Natural Language and Abstractions:
- NLP techniques make opinionated assumptions about what constitutes "natural language," leading to the need for uncomfortable abstractions.
- Poetry, by its nature, challenges conventional definitions of language and structure.
4. English Speakers Only Club:
- Most NLP libraries and techniques historically focus on English text.
- Mention of starting points for NLP in other languages, such as Konlpy for Korean and Jieba for Chinese.
5. English Grammar:
- The article introduces a humorous take on English grammar, emphasizing the fluidity and variability of language.
- Acknowledges the challenges computers face in accurately determining parts of speech.
6. Sentences, Parts of Speech, and Phrases:
- English texts are composed of sentences, and words within sentences have specific functions or parts of speech.
- Phrases, including noun phrases, play a crucial role in understanding the content and context of a text.
7. Morphology:
- Morphology refers to the modification of words to change their meaning, often involving prefixes and suffixes.
- Lemmatization is the process of reducing words to their basic forms, eliminating factors like verb tense and noun number.
8. Pluralization:
- Pluralization is highlighted as a complex aspect of morphology, and NLP libraries attempt to handle it, albeit with challenges.
9. TextBlob Usage:
- The article provides instructions on installing TextBlob and demonstrates its use for tasks such as sentence extraction, word extraction, and noun phrase extraction.
10. Tagging Parts of Speech:
- TextBlob's ability to tag parts of speech in a text is showcased, emphasizing the importance of tagging in NLP.
11. WordNet Integration:
- WordNet, a computer-readable thesaurus, is introduced, and TextBlob's interface to WordNet data is explored.
- Synsets, lemma names, and synonyms are discussed as essential components of WordNet.
12. Examples of Practical Applications:
- The article includes Python programs showcasing practical applications of TextBlob, such as summarizing text based on nouns, creating dada instructions, and synonymizing words.
In conclusion, this article provides a comprehensive introduction to fundamental concepts in NLP using TextBlob, catering to both beginners and enthusiasts in the field. The examples and explanations offer valuable insights into the practical application of NLP techniques using Python.