Skip to main content

Co occurance Matrix

 This article attempts to provide a brief introduction to the co-occurrence matrix and its implementation in python.

Given a document with a set of sentences in it, the co-occurrence matrix is a matrix form of representation of this document. To core idea of the co-occurrence matrix is to check if a particular word appears in the context of a focus word.

Let us take an example to understand this better. Let us consider a document containing two sentences S1 and S2 as shown in Figure 1.

There are three parts to creating a co-occurrence matrix. They are:

  1. Matrix of unique words
  2. Focus word
  3. Window length

Let us create a matrix of all the unique words in the document as shown in Figure 2. All the values in the table are initialized to 0.

Figure 2

Once the matrix is created, we scan through each word (focus word) of each sentence of the document. We also determine the window length. This is the number of words we are considering, around the focus word. These are our context words.

Our objective is to identify the number of context words for each focus word in the document for the given window length. The same process is elaborated in Figure 3.

Figure 3

Since the words ‘was’ and ‘a’ appear in the context of the word ‘Kalidasa’, the columns corresponding to ‘was’ and ‘a’ across the Focus word ‘Kalidasa’ in the table shown in Figure 2, is incremented by 1 as shown in Figure 4.

Figure 4

Since every focus word appears in its own context, all the diagonal elements are incremented. Hence, the column ‘Kalidasa’ is also incremented. Continuing the process as shown in Figure 5.

Figure 5

Upon doing this process for all the words in S1 and S2 we get the following matrix as shown in Figure 6.

Figure 6

In this way, the co-occurrence matrix can be created which can later be used for analysis.

In the current co-occurrence matrix, the stopwords are being included. This would unnecessarily increase the size of the matrix and also increase the computational cost. The size of the matrix and the computational cost can be reduced by:

  1. Removing the stopwords in the document
  2. Considering only the important words in the co-occurrence matrix using TFIDF vectorizer

Comments

Popular posts from this blog

ALS Implicit Collaborative Filtering

  Continuing on the collaborative filtering theme from my   collaborative filtering with binary data   example i’m going to look at another way to do collaborative filtering using matrix factorization with implicit data. This story relies heavily on the work of Yifan Hu, Yehuda Koren, Chris Volinsky in their paper on  Collaborative Filtering for Implicit Feedback  as well as code and concepts from  Ben Frederickson ,  Chris Johnson ,  Jesse Steinweg-Woods  and  Erik Bernhardsson . Content: Overview Implicit vs explicit The dataset Alternating least squares Similar items Making recommendation Overview We’re going to write a simple implementation of an implicit (more on that below) recommendation algorithm. We want to be able to find similar items and make recommendations for our users. I will focus on both the theory, some math as well as a couple of different python implementations. Since we’re taking a  collaborative filtering ...

Recommendation System and it's types

  Recommender systems are the systems that are designed to recommend things to the user based on many different factors. These systems predict the most likely product that the users are most likely to purchase and are of interest to. Companies like   Netflix , Amazon, etc. use recommender systems to help their users to identify the correct product or movies for them.    The recommender system deals with a large volume of information present by filtering the most important information based on the data provided by a user and other factors that take care of the user’s preference and interest. It finds out the match between user and item and imputes the similarities between users and items for recommendation.    Both the users and the services provided have benefited from these kinds of systems. The quality and decision-making process has also improved through these kinds of systems.   Check here a historical example of a recommendation engine  here....