Models of Computation, Document Distance

Rabin Gaire
9 min readMar 20, 2018

Hello, This is a 47 part series that tries to give an introduction to algorithms. The content that I am using here to write this series is from MIT 6.006 Introduction to Algorithms, Fall 2011. In this second part of the series, we are going to talk about the Models of Computation and an Algorithm called Document Distance.

Before talking about anything else let’s first try to define Algorithm.

What is Algorithm?

According to the professor of MIT 6.006 Introduction to Algorithms, Erik Demaine Algorithm is a

  • Computational Procedure to solve a problem
  • Algorithm is a mathematical analog (mathematical abstraction) of a computer program

Here the picture above describes that the algorithm is a mathematical analog of a computer program, A computer program is built on top of a programming language whereas an algorithm is written in pseudocode (pseudocode is just a fancy word for structured English or any other natural language). A computer program written on any programming language runs on a computer, and the analog of a computer on a mathematical world is called model of computation. Model of computation says what your computer can do, what it allows to do in constant time.

Model of Computation

Model of computation specifies

  • What operation an algorithm is allowed
  • How much does it cost (What is complexity?)(time, space, …)
  • Cost of algorithm = sum of operation cost

Let’s talk about two models of computation

  • Random Access Machine ( also known as RAM not Random Access Memory, this is not a typing mistake)
  • Pointer Machine

Random Access Machine

Random Access Machine falls on the side of a mathematical model of computation, whereas Random Access Memory falls on the side of a computer device. These two things are quite similar but we can’t call same. Anyway, we will use Random Access Memory to define Random Access Machine which is more of a mathematical analog of an actual Random Access Memory on a real computer.

Talking about Random Access Memory, it’s basically a large array. Random Access Memory is usually organized by word.

In order to compute with it we can say in Θ(1) (Constant Time) an algorithm can

  • Load Θ(1) constant amount of word from memory
  • Do Θ(1) amount of computation (+, -, *, /, ^, &, |)
  • Store Θ(1) words

This Random Access Machine is how we think about computers, it’s an accurate model of computation that loading, computing and storing all takes about Θ(1) constant amount of time. We can manipulate a whole word in a constant amount of time.

We introduced word a while ago, what does word mean

  • We know computer these are are of 32 bits and 64 bits, but being a little bit more abstract a word is w bits.

Let’s not worry much about word, let’s just assume it a value with w bits, that our computational model accepts, word comes in as inputs and we compute on them.

So, Random Access Machine is a realistic model and it’s also a powerful model, but a lot of code doesn’t use Arrays and we might not need it, So we must think about more abstract models that are not as powerful but offer a simpler way of thinking about things. For example Random Access Model don’t have dynamic memory allocation, we know we can allocate memory dynamically, we have seen them implemented on our computers. But it’s nice to think about some kind of model that already takes cares of things like dynamic memory allocation, It’s like a higher level programming abstraction. We have one model that basically helps us get around easily with dynamic memory allocation called Pointer Machine.

Pointer Machine

Pointer Machine corresponds to Object-Oriented Programming in a simpler version. So we have

  • Dynamically allocated objects
  • Objects have constant amount Θ(1) of fields
  • Field = Word (e.g. int) or a pointer (This is where pointer machine gets its name). What is pointer? A pointer is something that points to another object or has special value null.

The image above shows us a linked list, linked list have a bunch of field in each node, we might have a pointer to the previous element, a pointer to next element and some value. To be precise the image shows a type of linked list called Doubly Linked List we call it Doubly because each node has two pointers that point to the previous node and next node. Pointer prev of the initial node and the pointer next of the last node points to a null value. We actually have a two node doubly linked list in the picture.

So this is a structure in a pointer machine, it’s a type of a data structure, in python, we might call this as a tuple or an object with three attributes and the value in this structure can be called as a word. We can implement this model in Random Access Machine and a pointer becomes an index of a giant table which is basically an array in Random Access Machine, Actually this is how a pointer in C programming language has been implemented.

We will talk a lot more about link list in upcoming blogs but for now everything we do in linked list takes constant amount of time, following a pointer cost constant amount of time, adding a node cost constant amount of time, changing any value on a node cost constant amount of time, basically all the usual things that we might do to the objects on a node cost constant amount of time. Due to this factor, it is considered Linked List an efficient and important data structure in computer science.

So pointer machine is a weaker model than random access machine as we can implement pointer machine with random access machine but pointer machine allows us to think differently and lots of data structure are build using this concept.

This was all about theory, so let’s talk about a practical approach using python model.

Python Model

Let’s talk why do we use python model, it’s because both of the models we talked earlier (Random Access Machine, Pointer Machine) are old model of computation, and python is a recent programming language and also python implements somewhat amount of older model of computational perspective. Python has arrays, as we have already discussed arrays resemble Random Access Machine perspective and also it has pointers which resemble Pointer Machine perspective. Older models of computation (Random Access Machine, Pointer Machine) are simple and theoretical so whatever we do on them only takes a constant amount of time, but in python model, some operation takes a lot of time while other still takes a constant amount of time. So whenever we are thinking about an algorithm in python model or implementing in Python programming language, we must know which operation is faster and which operation is slower.

Let’s spend some of our time trying to figure out which operation of python is faster and slower.

Let’s take an operation L[i] = L[j] + 5 where L is the list, In python this operation will always take Θ(1) time (This example resemble Random Access Machine). In other programming languages, this might take linear time as it has to scan list L to j add 5 and place the result by scanning the list L again to i.

Objects (Resemble pointer machine model) with O(1) attributes. x = x.next here x being the pointer in a linked list, when there is an operation like this it also takes Θ(1) time in python model.

It’s been a while I am talking about constant time Θ(1), we don’t care much about this constant time because what we only care about is scalability with n and there is no n in these case.

List (Arrays are called list in Python)

L.append(x) (Here L being a list) To understand how much does it cost, we should think by how it’s implemented in terms of the basic operation (L[i] = L[j] + 5,Objects x = x.next) . Most of the things we do can be reduced to those two terms. So let’s get back to append. Basically thinking about how does an append work, we have a certain size of an array and want to append an element to make it one element larger, how do we do it? The obvious way is to create a new array with one more larger size than of the original one and copy all the contents over to the new array and add the new element at the last. If you have written C program before then this is how malloc/realloc works. This approach is fine but it takes linear time. In case of python it does not do in this way, we will talk in detail how this works in a later blog post, but for now, let’s say it just do something called table doubling which cost around constant time Θ(1) most of the time.

L = L1 + L2 (L, L1, L2 are lists)

append takes Θ(1) time

Θ(|L1|) order of length of L1

Θ(|L2|) order of length of L2

At last this operation takes Θ(1+ |L1| + |L2|), where did 1 come from? If a length of L1 and L2 is zero it still takes constant time to build the empty array.

Most of the time people get confused and say operation L = L1 + L2 takes constant time because they see + on the operation, It takes constant time only if L1 and L2 are simple variables but in our case they are not rather they are entire data structure called list(array), so while analyzing the algorithm we must take account the data structure in use.

x in L ( L is a list )

This operation takes linear time in worst case scenario as we have to scan through the items in the list to find the value.

len(L) (L is a list)

This operation takes Θ(1) time, How does this operation in python take constant time? Because in python list are implemented with a counter built in, which makes it easier to find the length of a list. In other programming languages, this operation might take linear time as it has to go through the entire list to count the length of the list, to be fair this is all related to how an operation is implemented in a programming language.

I am not going to talk more about this anymore as it is obvious how to find the cost of an operation if you want more information about how much does it cost to perform a certain operation in python please refer to python documentation.

Before dropping this topic I have a quiz question, answer this in the comment box.

Q. Let’s assume we have a two long integer x and y, how much does it cost to add them both?

Document Distance Problem

In document distance problem we have two document D1 and D2 and we want to compute the distance between them represented as:

d(D1, D2)

So the big question is what does distance mean?

Distance here means finding out whether two document are similar or not/Duplicate or not

Document = Sequence of words

Word = Sequence of alphanumeric characters

So the idea in this problem is to find the shared word between documents and use that to define distance (Document distance).

We can also think document as a vector so

D[W] = # Occurance of word W in document D

For example

Let’s take two documents

D1 = “the cat”

D2 = “the dog”

We can represent them in 3D space because we just have three different words here in our document D1 and D2

D1 and D2 in the above figure are vectors.

So at first attempt, we can define document distance using an inner product of vector calculus which is also called dot product which can be used to find the difference between vectors. ( Math Math Math )

d’(D1, D2) = D1.D2 = wD1[W].D2[W]

But we have a problem because a long document with 99% same word is said to be farther than the short document with 10% same word. To fix this issue we have to divide the above equation with the length of the vectors which turn out to be like this

d’’(D1, D2) = (D1.D2)/(|D1|.|D2|)

Well, this was quite a long blog. Hope you got what I meant in this blog. If you want the reference from where I took content to write this blog then the reference has been listed below

--

--