realcode4you
- Feb 20, 2022
- 5 min read

Text Processing Using NLP, POS taging, NER, Dependency parsing, Constituency parsing

Command to create and setup vitual enviroment

Conda create --name
conda install jupyter
conda install ipykernel
python -m ipykernel install --user --name

install packages

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
pip install stanza pandas sqlalchemy psycopg2-binary

Text Processing - Stanza

import stanza
import time
import pandas as pd
import sqlalchemy as sal
import psycopg2
from sqlalchemy import text

quick start


nlp = stanza.Pipeline('en') # This sets up a default neural pipeline in English

raw_doc = '''The Company is focused on expanding its market opportunities related to smartphones, personal computers, tablets, wearables\nand accessories, and services. The Company faces substantial competition in these markets from companies that have\nsignificant technical, marketing, distribution and other resources, as well as established hardware, software, and service offerings\nwith large customer bases. In addition, some of the Company’s competitors have broader product lines, lower-priced products\nand a larger installed base of active devices. Competition has been particularly intense as competitors have aggressively cut\nprices and lowered product margins. Certain competitors have the resources, experience or cost structures to provide products\nat little or no profit or even at a loss. The Company’s services compete with business models that provide content to users for\nfree and use illegitimate means to obtain third-party digital content and applications. The Company faces significant competition\nas competitors imitate the Company’s product features and applications within their products, or collaborate to offer integrated\nsolutions that are more competitive than those they currently offer'''

doc = nlp(raw_doc)
doc
doc[0]
type(doc)
doc.to_dict()[0][3]
doc.sentences

Pipeline building To start annotating text with Stanza, you would typically start by building a Pipeline that contains Processors, each fulfilling a specific NLP task you desire (e.g., tokenization, part-of-speech tagging, syntactic parsing, etc). The pipeline takes in raw text or a Document object that contains partial annotations, runs the specified processors in succession, and returns an annotated Document. To build and customize the pipeline, you can specify the options: https://stanfordnlp.github.io/stanza/pipeline.html#pipeline

For processors: https://stanfordnlp.github.io/stanza/pipeline.html#processors For data objects: https://stanfordnlp.github.io/stanza/data_objects.html#word

# nlp = stanza.Pipeline('en', processors='tokenize, pos, lemma', use_gpu=False, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
nlp = stanza.Pipeline('en', processors='tokenize, lemma', use_gpu=False, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp(doc) # Run the pipeline on the input text

for sentence in doc.sentences:
#     print(sentence.entities)
#     print(sentence.dependencies)
    for word in sentence.words:
        print(word.text, word.lemma, word.pos)

Text Processing - Spacy

https://spacy.io/
data types: https://spacy.io/api/doc
default pipeline: https://spacy.io/models

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(raw_doc)
type(doc)

for i in doc:
    print(i.text, i.lemma_)

# Find named entities
for entity in doc.ents:
    print(entity.text, entity.label_)

Modify default pipeline: https://spacy.io/usage/processing-pipelines#disabling


# Load the pipeline without the entity recognizer
nlp = spacy.load("en_core_web_sm", exclude=["ner"])

# Load the tagger and parser but don't enable them
nlp = spacy.load("en_core_web_sm")
# Explicitly enable the tagger later on
# nlp.enable_pipe("tagger")

analysis = nlp.analyze_pipes(pretty=True)

Output:

Process large amount of data When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts. spaCy’s nlp.pipe method takes an iterable of texts and yields processed Doc objects. The batching is done internally.


texts = ['''The markets for the Company’s products and services are highly competitive, and are characterized by aggressive price
competition and resulting downward pressure on gross margins, frequent introduction of new products and services, short
product life cycles, evolving industry standards, continual improvement in product price and performance characteristics, rapid
adoption of technological advancements by competitors, and price sensitivity on the part of consumers and businesses. Many of
the Company’s competitors seek to compete primarily through aggressive pricing and very low cost structures, and by imitating
the Company’s products and infringing on its intellectual property.t''', '''The Company’s ability to compete successfully depends heavily on ensuring the continuing and timely introduction of innovative
new products, services and technologies to the marketplace. The Company designs and develops nearly the entire solution for
its products, including the hardware, operating system, numerous software applications and related services. Principal
competitive factors important to the Company include price, product and service features (including security features), relative
price and performance, product and service quality and reliability, design innovation, a strong third-party software and
accessories ecosystem, marketing and distribution capability, service and support, and corporate reputation.''', '''The Company is focused on expanding its market opportunities related to smartphones, personal computers, tablets, wearables
and accessories, and services. The Company faces substantial competition in these markets from companies that have
significant technical, marketing, distribution and other resources, as well as established hardware, software, and service offerings
with large customer bases. In addition, some of the Company’s competitors have broader product lines, lower-priced products
and a larger installed base of active devices. Competition has been particularly intense as competitors have aggressively cut
prices and lowered product margins. Certain competitors have the resources, experience or cost structures to provide products
at little or no profit or even at a loss. The Company’s services compete with business models that provide content to users for
free and use illegitimate means to obtain third-party digital content and applications. The Company faces significant competition
as competitors imitate the Company’s product features and applications within their products, or collaborate to offer integrated
solutions that are more competitive than those they currently offer''']
start = time.time()
docs1 = [nlp(text) for text in texts]
end1 = time.time()
docs2 = list(nlp.pipe(texts))
end = time.time()

end1-start
end - end1

Output:

0.0818321704864502

AutoPhrase - store news to file


https://github.com/shangjingbo1226/AutoPhrase

engine = sal.create_engine('postgresql+psycopg2://ag_class:U2h]mkc@awesome-hw.sdsc.edu/postgres')

conn = engine.connect()

sql = text('''select count(*) from usnewspaper where src ilike '%wsj%' ''')
result = conn.execute(sql) 

result.fetchone()

(288150,)


sql = text('''select news from usnewspaper where src ilike '%wsj%' limit 50000''')
result = conn.execute(sql) 


file = open("/Users/xiuwenzheng/Documents/mgtf/AutoPhrase/data/EN/wsjnews.txt", "w")
for news in result:
    news = news[0].replace('\\n', ' ')
    news = news.replace('\\t', ' ')
    file.write(news)
    file.write("\n")
file.close()

POS taging, NER, Dependency parsing, constituency parsing


https://stanfordnlp.github.io/stanza/depparse.html

nlp = stanza.Pipeline(processors='tokenize,pos,lemma,depparse')
doc = nlp(raw_doc)

Output:

print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' 
        for sent in doc.sentences 
        for word in sent.words], sep='\n')


print(*[f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}' 
        for sent in doc.sentences for word in sent.words], sep='\n')

raw_doc

output:

('The Company’s ability to compete successfully depends heavily on ensuring the continuing and timely introduction of innovative\nnew products, services and technologies to the marketplace. The Company designs and develops nearly the entire solution for\nits products, including the hardware, operating system, numerous software applications and related services. Principal\ncompetitive factors important to the Company include price, product and service features (including security features), relative\nprice and performance, product and service quality and reliability, design innovation, a strong third-party software and\naccessories ecosystem, marketing and distribution capability, service and support, and corporate reputation.',
 'The Company is focused on expanding its market opportunities related to smartphones, personal computers, tablets, wearables\nand accessories, and services. The Company faces substantial competition in these markets from companies that have\nsignificant technical, marketing, distribution and other resources, as well as established hardware, software, and service offerings\nwith large customer bases. In addition, some of the Company’s competitors have broader product lines, lower-priced products\nand a larger installed base of active devices. Competition has been particularly intense as competitors have aggressively cut\nprices and lowered product margins. Certain competitors have the resources, experience or cost structures to provide products\nat little or no profit or even at a loss. The Company’s services compete with business models that provide content to users for\nfree and use illegitimate means to obtain third-party digital content and applications. The Company faces significant competition\nas competitors imitate the Company’s product features and applications within their products, or collaborate to offer integrated\nsolutions that are more competitive than those they currently offer')

nlp = stanza.Pipeline(processors='tokenize,pos,lemma,ner')
doc = nlp(raw_doc)

output:

tree = doc.sentences[0].constituency
tree.label
tree.children

output:

[(S (NP (DT The) (NN Company) (POS ’s) (NNS customers)) (VP (VBP are) (ADVP (RB primarily)) (PP (IN in) (NP (DT the) (NML (NML (NN consumer)) (, ,) (ADJP (JJ small)) (CC and) (NML (JJ mid-sized) (NN business)) (, ,) (NML (NN education)) (, ,) (NML (NN enterprise)) (CC and) (NML (NN government))) (NNS markets)))) (. .))]

print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

output:

entity: Company	type: ORG
entity: Company	type: ORG
entity: Company	type: ORG
entity: Company	type: ORG
entity: third	type: ORDINAL
entity: Company	type: ORG
entity: Company	type: ORG

If you have problem to write the code for NLP text processing then share your requirement details at:

realcode4you@gmail.com

And get instant help with an affordable price.

RealCode4You

Text Processing Using NLP, POS taging, NER, Dependency parsing, Constituency parsing

Recent Posts