Getting started with Natural Language Processing with Implementation

This article comprises of some handy codes that we all require as developers like scrapping from a docx file and pdf file, tokenization, string handling, and docx file comparison.
TKTejas Khare24.00
May 06, 2021

In the previous article of Introduction of Natural Language Processing, we saw what is NLP, where is it used, and some approaches to how is it used. In this article, we will be implementing some very useful codes that come in handy every time when you are dealing with NLP.

Named Entity Recognition with NLTK and SpaCy | by Susan Li | Towards Data  Science

To train your machine learning or a deep learning model, the first step is to collect the required and relevant data. So if you are dealing with NLP, I assume you either have your data in some text format or you need to extract it yourself. Hence, one of the main highlights of this article is how to extract data from different types of file formats. Secondly, often you will need to tokenize your sentences or paragraphs to vectorize them. Hence we will also be talking about tokenization. Furthermore, all the useful string functions you need to clean your data. And finally, a real-life example of how to compare two Docx documents and identify where the change is made.

Contents of the article

  1. Extracting data from a pdf file
  2. Extracting data from a docx file
  3. Tokenization
  4. Handful string operations for cleaning your data
  5. Comparing two Docx files

Note: All the result images are taken from the author's own Jupiter notebook output.

1. Extracting data from a pdf file.

There's a high chance that the data you require is in a pdf file. You cannot open the file using simple file operations in python, but there are some libraries you can use. I have used mainly PyMuPDF to parse through a pdf file -


!pip install PyMuPDF
import fitz

def pdf_parser(doc_path):
    This function will help to parse through PDF.
    We have to pass the path of PDF file and it will return extracted blocks from the PDF
    in the form of list of dictionaries.
    :args: doc_path
    :return: list of blocks of dictionaries containing these keys - 'size', 'flags', 'font', 'color',
                'ascender', 'descender', 'text', 'origin', 'bbox'.
    doc =
    blocks_extract = []
    for page in doc:
        blocks = page.getText("dict")["blocks"]
        for b in blocks:
            if b['type'] == 0:
                block_string = ""  # text found in block
                for l in b["lines"]:  # iterate through the text lines
                    for s in l["spans"]:  # iterate through the text spans
    data = [blocks_extract[i] for i in range(len(blocks_extract)) if len(blocks_extract[i]['text'])>3]
    return data
# The sample.pdf document contains some information about a medicine 
list_pdf = pdf_parser('sample.pdf')

Now, we only need the 'text' key for now. You can explore the other keys if you want to.

# Prints a string output
for i in list_pdf:

From here, you can apply your own text pre-processing rules and obtain your required format.

Note: This pdf parser may not work with scanned pdf documents.

2. Extracting text from a docx file

Besides a pdf file, a docx file is also a common source of text data to be extracted. Hence I have also included this code in the article.

docx parser

!pip install python-docx
import docx

def getText(filename):
     This function parses docx files and returns a list of strings
    args: path of the file
    returns: list of strings containing the text which are splitted by new line 
    count = 0
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        # Making sure that empty string does not append into the list
        if len(para.text) > 3:
    return fullText
text = getText('sample_doc.docx')

3. Tokenization

Tokenization is basically splitting up sentences or paragraphs into individual words or sentences respectively which are called tokens. We can do this by simple keyword string.split(' '). But this won't work on tokenizing sentences. Hence we can use nltk to do so.


1. wordpunct_tokenize

from nltk.tokenize import wordpunct_tokenize

string1 = "The Egyptian pyramids are ancient pyramid-shaped masonry structures located in Egypt."
tok_string = wordpunct_tokenize(s)

As you can see, the callable function wordpunct_tokenize splits the string into words, and also if you notice, you get the dash "-" and the full stop "." as separate tokens. In some cases, you might be needing them hence it is better to use wordpunct_tokenize(). You can always replace the unwanted string with an empty string or just remove the element from the list.

2. sent_tokenize

from nltk.tokenize import sent_tokenize

string2 = "The Egyptian pyramids are ancient pyramid-shaped masonry structures located in Egypt. As of November 2008, sources cite either 118 or 138 as the number of identified Egyptian pyramids. Most were built as tombs for the country's pharaohs and their consorts during the Old and Middle Kingdom periods."
tok_string2 = sent_tokenize(s)

As you can now see the difference between wordpunct_tokenize and sent_tokenize, the latter one splits the paragraphs into sentences.

4. Handful string operations for cleaning your data

These are some very handy string functions to clean, preprocess and make add some final touches for your data.

s.find(t)             # index of first instance of string t inside s (-1 if not found)
s.rfind(t)            # index of last instance of string t inside s (-1 if not found)
s.index(t)           # like s.find(t) except it raises ValueError if not found
s.rindex(t)          # like s.rfind(t) except it raises ValueError if not found
s.join(text)         # combine the words of the text into a string  using s as the glue
s.split(t)             # split s into a list wherever a t is found  (whitespace by default)
s.splitlines()       # split s into a list of strings, one per line
s.lower()            # a lowercased version of the string s
s.upper()           # an uppercased version of the string s
s.title()              # a titlecased version of the string s
s.strip()             # a copy of s without leading or trailing whitespace
s.replace(t, u)   #replace instances of t with u inside s

5. Comparing two docx files

In this application, we will be comparing two docx files. To elaborate, we want the exact sentence in which the change was made. We also want to know if anything was deleted in the new file.

We will be using <mark>string</mark> tags to indicate where the changes or additions were made and <del>string</del> tags to indicate which sentence was deleted.

from nltk.tokenize import sent_tokenize
import docx
from docx import Document

# We will use pre trained Word2vec model 
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
corpus = api.load('text8')
model = Word2Vec(corpus)

def pre_processing(data):
    This function is called in every data to create the vector of the data, the data should be in the form of list with all the text
        data: data for which we want the vector (list of string)

    Returns: sentence vector of the data

    # model = Word2Vec.load("smpil/Model/W2V_SmPC_PIL.model")
    vec_data = []
    # print(type(data))
    for i in range(len(data)):
        text = re.sub('[^A-Za-z0-9]+', ' ', data[i].lower())
        text = word_tokenize(text)
        if len(text) > 0:
    return vec_data

def compare(ori, upd):
    This function compares two docx files and gives the exact location where the changes were made. The changes
    can be updation, addition or deletion.
    args: Two files from getText to be compared
    returns: ori - document marked with location marked with respective tags
             change_sent - changed sentences or added sentences 
             deleted - deleted sentences from the documet
    unchanged = ori
    changed = []
    change_sent = []
    deleted = []
    orignal = []
    # loop for each element which is a paragraph in a string
    for i in range(len(ori)):
            # Sentence Tokenization of each paragraph
            new_ori = sent_tokenize(ori[i])
            new_upd = sent_tokenize(upd[i])

            upd_set = set(new_upd)
            ori_set = set(new_ori)
            # Taking difference of the two sets
            old_added = ori_set - upd_set # Contains the orignal sentence for taking similarity for finding the deleted sentences. 
            changes = upd_set - ori_set  # Contains changed or added sentences

            #two loops to compare in both ori and upd line by line
            #add <mark> for changed after getting the index
            for line in new_ori:
                if line in old_added:
                    index_ori = new_ori.index(line)
                elif line in changes:
                    index = new_ori.index(line)
                    new_ori[index] = "<mark>" + new_ori[index] + "</mark>"
        #             print("{} Hello\n".format(i))
                    ori[i] = ",".join(new_ori)

            for line in new_upd:
                if line in old_added:
                    index_ori = new_ori.index(line)
                elif line in changes:
                    index1 = new_upd.index(line)
                    new_upd[index1] = "<mark>" + new_upd[index1] + "</mark>"
                    ori[i] = ",".join(new_upd)

            # Get deleted sentence by getting similarity of orignal and changed. The similarity would be definitly
            # less than 0.9 for sentences which are deleted. Get the index of those sentences and add <del> tag
            if len(changed) == 0:
                for j in orignal:
                    new_ori[index_ori] = "<del>" + new_ori[index_ori] +"</del>"
                    smpc_ori[i][1] = ",".join(new_ori)

                changed_vec = pre_processing(changed)
                orignal_vec = pre_processing(orignal)
                sim = similarity(changed_vec, orignal_vec)
                for j in sim:
                    if j[2] < 0.9:
                        new_ori[index_ori] = "<del>" + new_ori[index_ori] +"</del>"
                        ori[i] = ",".join(new_ori)
    if len(changed) == 0 and len(deleted) == 0:
        return unchanged, None, None
        return ori, change_sent, deleted 

Now let's get the two docx files using the same getText() function defined in the previous content.

ori = getText('Orignal.docx')
upd = getText('Updated.docx')

Here is the Orignal document after getText()-

Here is the Updated document after getText()-

change_doc, change_sent, deleted = compare(ori, upd)

Here is how change_doc looks like. I have selected the sentences (highlighted in blue) to make it easier for you to locate the output. I made these respective changes in the document -

  1. Added the sentence - "The Pyramids are not in Australia."
  2. Changed 2.6 million to 26000 million in the sentence - "The base was measured to be about.......which includes an internal hillock."
  3. Deleted the sentence - "The outside layers were bound together by mortar."

You can see the <mark> and <del> tags are added at their accurate locations. You can also check the individual changes and deletions by printing change_sent and deleted.

Note: docx parser can work incorrectly if there are any unintentionally placed spaces in the document. The input to the compare() function will be noisy and will give incorrect results. To get the best results, the formatting of the original and updated document should be the same.

There it is that's it for this article. Thank you for going through it. Cheers :)

3 votes
How helpful was this page?