Introduction to NLP - Getting Started

Background

In today's world of all sorts of Artificial Intelligence coming in, a major element required for the machines to be trained is "Data". A major portion of the data today is generated from social media and mediums like virtual assistants, blogs, news, videos, audio, images, all sorts of papers(research, white) etc. are mostly unstructured. As per current scenarios, there are around 8.5 billion Google searches per day and approximately 2 trillion global searches per year, similarly for Bing we have around 27 billion web searches per month. 37.5 million web searches per hour. But, if we go by industry estimates, less than 25% of the data today is available in structural or tabular data.

So, now the question arises when the data is available in human languages then how we can use them to train machines and get accurate AI machines. The answer for one of the problems, i.e. the textual data is "Natural Language Processing".

What is Natural Language Processing?

NLP can be defined as a field of Artificial intelligence, that is concerned with providing a medium to make machines learn human languages so that they can understand, translate, generate and converse with humans in their natural language.

According to Wiki, it can be defined as

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language.

It can be thought of as an intersection of Artificial intelligence, computer science and Human language.

In simple words, it is the technology that is used to make the machine understand Human Languages or Natural Languages, hence the name Natural Language Processing(NLP).

Common Use case

A few of the common use cases where NLP is being used or can be used are as below:-

Chatbots and virtual assistants: the most common and easily visible use of NLP is in chatbots and Virtual Assistants, who can easily understand human languages both text and speech and can reply appropriately based on it.
Machine Translation: A good use of NLP is in translating one language to another e.g. English to French, Hindi to English, Bengali to Spanish etc. in real-time, which enables people from various countries to interact seamlessly.
Text Summarization: With the advent of genAI a new use case which is highly been used by the masses is summarizing long texts.
Sentiment analysis: Analyzing the reviews by customers and categories based on their sentiments i.e. positive or negative can be done easily by using NLP.
Spam Filtering: We daily use NLP in our emails to filter out spam emails.
Named Entity Recognition (NER): it can identify and extract named entities such as people, places, and organizations from the text.

and many more.. any place where you can see a lot of text try thinking of a use case for how you use NLP, let us know by commenting down below who knows you might have figured out a new case for NLP.

Challenges & Complexities

Now let's try to understand the various challenges and complexities that we face while dealing with Unstructured Textual Data.

Lack of structure:- A major issue while dealing with text data is that there is no set defined structure for it, some of it can be in the form of tweets, some in documents, some in points etc. Basically, it can differ from source to source.
Complex Syntax and Grammar:- Natural languages have complex syntax and rules(grammar), which can again vary from language to language i.e. English, Hindi, Spanish, French etc.
Cross-Language Challenges:- Working with multilingual text data introduces additional complexities due to language variations, translation challenges, and different linguistic structures.
Domain-specific knowledge:- With changing domain the meaning of sentences can change and knowing the technical jargon is important for getting the context of the sentences.
Data Quality and Bias:- Ensuring data quality is challenging, and biases present in training data can be perpetuated in NLP models, potentially leading to biased outputs.
Ambiguity:- when a sentence or phrase can be interpreted in multiple ways due to unclear or multiple meanings of the words or phrases. These ambiguities can be further subdivided into the following categories:-

Lexical Ambiguity:- This type of ambiguity arises when a word can have more than one meaning. E.g.. 'Bat' is the mammal or the one used by players in the game of cricket or baseball.
Syntactic Ambiguity:- This type of ambiguity arises when a sentence can be parsed in multiple ways. E.g. "I saw a man with glasses" which can be interpreted as I saw a man who was wearing glasses or I was able to see a man after wearing the glasses.
Semantic Ambiguity:- This type of ambiguity arises when the same word can have multiple interpretations based on the context. E.g. "Bank" can refer to as the financial institution as well as the side of a river. <Note: Some of the places Lexical and Semantic Ambiguity are considered to be similar>
Referential/Anaphoric Ambiguity:- This occurs when it is to what or to whom the pronoun in the sentence is pointing. E.g. "He gave to pen to him." it is unclear to whom "He" or "him" refers to.
Temporal Ambiguity:- This occurs when the time frame or sequence of events is not clear. E.g. "We went out post-lunch", it is unclear if we went immediately after the lunch or at a later time.
Quantifier Ambiguity:- It arises when terms like "some," "many," or "few" are used without specifying the exact quantity. For example, "Some students passed the exam" does not indicate how many students passed.

Thus, working with Natural Languages becomes a complex task, because with a slight mistake, we can train a wrong model and thus waste our time and resources.

Key terminologies in NLP:-

Before we dive deeper into further sections, let's try to understand the key terms in NLP so that a future reference to them is easy to grasp the teachings.

Tokenization: it is the breaking of sentences into smaller building blocks of the sentences. These building blocks can be words or phrases, also known as 'tokens'.
E.g. "I like reading articles on QDS." upon tokenizing this sentence we will get words like
"I"
"like"
"reading"
"articles"
"on"
"QDS"
Tokens: they are the small pieces of text used in NLP. They are the fundamental units of text that are processed and analyzed by the NLP models. They can be words, phrases or N-grams.
Corpus: it is defined as the collection of text documents. E.g. collection of books, research papers, news, datasets containing comments on social media etc. It can be understood as
Corpus > Documents > Paragraphs > Sentences > Tokens
N-Grams: N-grams are sequences of words or phrases taken "n" at a time. They help capture the contextual information and relationships between these words or phrases.
E.g. "I like Quick Data Science"
Uni-gram:- "I", "like", "quick", "data", "science".
Bi-gram:- "I like", "like quick", "quick data", "data science"
Tri-gram:- "I like quick", "like quick data", "quick data science"
and so on.
Normalization: it is the process of cleaning and preprocessing the text to make it consistent and usable for NLP. It comprises many strategies like Lemmatization, Stemming, Stopword removal, punctuation removal etc.
Lemmatization: it is a systematical step-by-step way of converting or reducing the words to their simplest form usually to their base or dictionary form, which is known as the lemma. It helps in simplifying the words so that we don't have to deal with different variations of the same word.
E.g.
'running', 'runs', 'ran', 'run' >> 'run'
'changed', 'changing', 'changes', 'change' >> 'change'
Stemming: it is a more straight way of reducing the words to their base forms by removing suffices from it.
E.g.
'Running' >> 'Run'
'Studying' >> 'Study'
'Studies' >> 'Studi'
Stemming is not always a good approach because at times it can produce words which are not present in dictionary.
E.g. 'His teammates are switching sides for winning' upon stemming it will be reduced to
'His' > 'hi'
'teammates' > 'teammate'
'are' > 'are'
'switching' > 'switch'
'sides' > 'side'
'for' > 'for'
'winning' > 'winn'
'Hi teammate are switch side for winn', which does not make any sense and has words that are not actual words. Thus, stemming should be used with caution.
Stopwords: commonly used words like "a", "the", "are" etc. in a particular language that are present everywhere but have very little or no contribution to the actual meaning of the sentences.
E.g. "The quick brown fox jumps over the lazy dog." here "the" is the stopword.
Bag-of-Words (BoW): It is the way of text in the form of an unordered set with the word as the key and the number of times it has been repeated(value) as its value.
E.g. "I like Quick Data Science"
BoW - {"I":1, "like": 1, "Quick":1, "Data":1, "Science":1}
Part-of-Speech (POS) Tagging: it is like labelling each word with its grammatical role or we can say, it's assigning grammatical categories like Noun, Verb, Adjective etc. to each word in the sentence.
E.g "I like Quick Data Science"
POS -- "I" - Pronoun, "like" - Verb, "Quick" - Adjective, "Data" - Noun, "Science" - Noun
Word Embedding: A word embedding serves as a representation of a word, utilized in text analysis. It usually takes the form of a real-valued vector, encoding the essence of the word so that words with similar meanings are positioned closer to each other in the vector space.
E.g. We have generated a word embedding for "Quick Data Science", the words in green colour are closer to the original words, whereas the black colour's ones are a bit far from the original word.
Word Embeddings
Named Entity Recognition (NER): it is like training a model to look for specific types of information like names, locations, organizations, and dates within text.
E.g. "QDS is based out of Varanasi, India"
here the model will recognize "QDS" as the organization and "Varanasi, India" as the Location.

Summary

We have learnt about the basics of NLP, how NLP can be used in our daily lives, challenges faced while working with NLP, and key terminologies used in NLP. In the next section, we will be setting up our workstations for NLP and know what libraries are present for NLP in Python.

QuickDataScience | Quick & Easy Data Science

Search This Blog