Skip to main content

Posts

Showing posts with the label Data Science

Digital Twins -- An aid in medical treatments

The first question that comes to mind is What is Digital Twin? and How is it related to Data Science and Medicines that we are studying here.  Don't worry we are going to start our post from this point only. What is Digital Twin? As the name suggests, Digital Twin is something related to Twin or a copy of something with a digital presence, or Virtual Presence. So, now the question is What is this "something" of which we create a copy? This "Something" can be anything process, plant, factory, human, machine etc. which is not well defined and might require multiple rounds of mix and match setups to get to the desired/maximum efficiency. These repeated steps not only take a lot of time but also a huge amount of money is also involved in the whole process. Also, there is an uncertainty that the process would succeed or not and how much more time will be required to get the final version.  Let's take a small example to understand the whole concept of the Digital

40,000 Weapons in Just 6 Hours

  You might have also been shocked by reading the Title... and might be thinking what are we talking about.  Wait... it's just starting, we got more... With the increasing use of computers and growing technologies, machine learning and artificial intelligence is gaining popularity and has also become one of the most trending jobs in recent years. But as we know everything has both positive and negative sides. The same is the case here also, a machine learning model takes a lot and a lot of effort to design and achieve a model that can work for the betterment of human society, but it takes lesser effort to make the machine perform against or make wrong judgements. A Brief Background A machine learning model to find new therapeutic inhibitors of targets for human diseases, which guides a molecule generator "MegaSyn".  A public database was used to train the AI model which was created after inverting the basic principle of the MegaSyn, i.e. the Machine Learning model which w

Feature Scaling -- Scaling to Unit Length

  Let's see a more technical feature scaling method, that we can use for scaling our dataset. It is popularly known as "Scaling to Unit Length", as all the features are scaled down using a common value. Unlike previous methods that we have studied so far, used to scale the features based on some value specific to the variable, here all the variables are used to scale the features. Here,  the scaling is done row-wise to make the complete vector has a length of 1, i.e. normalisation procedure normalises the feature vector and not the observation vector.  Note:-  Scikit-learn recommends this scaling procedure for text classification or clustering. Formula Used:-  Scaling to Unit Length can be done using 2 different ways:-  1. Using L1 Norm:-  L1 Norm or popularly known as Manhattan Distance can be used to scale the datasets.  Scaling to Unit Length using Manhattan Distance where l1(x) can be calculated using the below formula. Manhattan Distance Formula 2. Using L2 Norm:- L2

Feature Scaling -- Robust Scaling

  Another technique in feature scaling is Robust Scaling, also known as Scaling to quantiles and median. Robust scaling uses the Median and inter-quantile range for scaling the values of our dataset. Quantiles can be defined as the cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. Eg. 25th quantile, 75th quantile, 50th quantile.  The Inter-Quantile Range can be defined as the difference between upper and lower quantiles. Median is the middle value in a series when arranged in ascending or descending order. The logic used here is to subtract the median from each value to reduce the overall median to 0 and divide the difference by the difference between the 75th quantile and 25th quantile. Formula Used:-  Robust Scaling Formula Features of Robust Scaling:-  1. Median is centred at 0:-  Since the median value is subtracted from each value individually to scale the dataset thus it reduces and centres the median at 0 for each

Data Visualization — IPL Data Set (Part 1)

Welcome to the 2nd Post in the series of Data Visualization, one of the most loved/followed topics of India — IPL (Indian Premier League) (Part 1) In this, we will be focusing on the various analysis based on the Teams. Overview of the Data Set Description of columns of IPL Dataset Let’s Begin by Checking Data in these columns IPL Data Set 1 Overview IPL Data Set 2 Overview Moving towards the most interesting part, Visualize the dataset and relations. Let's begin by having a look at the total wins by each Team since 2008... Team VS No. of Match Wins The above Bar chart shows the top 5 teams with the most number of Match Wins across all the seasons. Surprisingly, RCB in among the top 5 still hasn’t won any IPL Season. Now Let’s have a look at these wins Team VS Wins based on runs/wickets The above Bar chart is a detailed version of the previous graph which shows the top 5 teams with the most wins divided by wins achieved batting first and batting second. Blue Bars represent the wins

Partitioning in HIVE - Learning by Doing

< Previous    Partitioning in Hive We studied the theory part involved in Partitioning in Hive in our previous article. Time to get our hands dirty now.  We will be following the below pattern for the Coding part:-  1. Hadoop Installation . 2. Hive Installation . 3. Static Partitioning.  {The theory part is covered in the previous article.} 4. Dynamic Partitioning. {The theory part is covered in the previous article.} Hope we have installed, and have Hadoop and Hive running. 

Partitioning in Hive

What is Partitioning? In simple words, we can explain Partitioning as the process of dividing something into sections or parts, with the motive of making it easily understandable and manageable. In our everyday  routine  also, we use this concept to ease out our tasks and save time. But we do it so abruptly that we hardly notice how we did it.  Let's see an example and get familiar with the concept.  Suppose we have a deck of cards and need to fetch "Jack of Spades" from the deck of cards. So now there are two ways in which we can accomplish this task. We can start turning over every card one by one, starting from the top/bottom until we reach our card. We group the deck according to suit, i.e. clubs, hearts, spades, diamonds. Now, as soon we hear "Spades", we know which group to look for, thus dividing our work 1/4 times. This grouping of our data according to some specific category reduced our work and saved energy, time and effort.  Defining in Technical Term

Getting Acquainted with NewSQL

Introduction NewSQL is a relatively new Database Management System. It is so nascent that it is still not listed as proper DBMS because the rules and regulations are still unclear.  To understand NewSQL, we need to know why the need was raised for a new Database when we already had two great and successful Databases(SQL & NoSQL).  SQL is the most widely used and most preferred database of all time. The ACID properties used for it makes it one of its kind and ranks it higher than the other.  NoSQL is another rising database that has recently gained more limelight due to the rise in Big data technologies and the need to store enormous documents/data coming from different sources. We can read more about the difference between SQL VS NoSQL . 

ExploriPy -- Newer ways to Exploratory Data Analysis

Introduction  ExploriPy is yet another Python library used for Exploratory Data Analysis. This library pulled our attention because it is Quick & Easy to implement also simple to grasp the basics. Moreover, the visuals provided by this library are self-explanatory and are graspable by any new user.  The most interesting part that we can't resist mentioning  is the easy grouping of the variables in different sections. This makes it more straightforward to understand and analyze our data. The Four Major sections presented are:-  Null Values Categorical VS Target Continuous VS Target Continuous VS Continuous  

The Explorer of Data Sets -- Dora

Exploring the dataset is both fun and tedious but an inevitable step for the Machine Learning journey. The challenge always stands for correctness, completeness and timely analysis of the data.  To overcome these issues lot of libraries are present, having their advantages and disadvantages. We have already discussed a few of them( Pandas profiling , dtale , autoviz , lux , sweetviz ) in previous articles. Today, we would like to present a new library for Exploratory Data Analysis --- Dora.  Saying only an EDA library would not be justified as it does not help explore the dataset but also helps to adjust data for the modelling purpose.