Data Mining Java - Implementation of PageRank Algorithm

1. Pre-knowledge of PageRank algorithm

PageRank algorithm: Calculate the PageRank value of each web page, and then sort the importance of the web pages according to the size of this value.

From the user's point of view, a website is a collection of several pages. However, for the designer of the website, these pages are carefully organized and are connected in series through the links of the pages as a whole. Therefore, Web structure mining is mainly to discover the page link structure in the website. For example: when designing services such as search engines, mining the link structure of Web pages can obtain useful knowledge to improve retrieval efficiency and quality.
In general, links to Web pages are similar to academic citations, so an important page may have links to many pages pointing to it. That is, if there are many links pointing to a page, it must be an important page. Likewise, a page has significant value if it links to many pages.

Suppose u is a Web page, Bu is the set of all pages pointing to u, Fu is the set of all pages pointing to u, c (<1) is a normalization factor, then the rank R(u) of u page is given by Defined as: R(u)=c∑_(v∈Bu)▒(R(v))/(|Fv|), obviously, the basic page classification method mainly considers the in-degree of a page, that is, by entering the The page rank of the page is obtained. At the same time, when the rank value of a page is passed, it is passed to all the pages it points to using the average distribution method, that is, each page linked from it divides its rank value equally.

The core part of the PageRank algorithm can start with a directed graph. The most typical method is to construct an adjacency matrix according to the directed graph for processing. The element ai,j(∈[0,1]) in the adjacency matrix A=(ai,j) represents the probability of pointing from page j to page i.

Two, the basic idea of ​​the PageRank algorithm

When the basic PageRank algorithm calculates the rank value, each page distributes its own rank value evenly to the page nodes it references. Assuming that the level value of a page is 1, there are n hyperlinks on the page, and the level value assigned to each hyperlink page is 1/n, then it can be understood that the page jumps to on any page it refers to.
Generally, the adjacency matrix A is converted into the so-called transition probability matrix M to implement the PageRank algorithm: M=(1-d)Q+dA, where Q is a constant matrix, the most commonly used is Q=(qi,j) , qi,j=1/n, the transition probability matrix M can be used as a vector transformation matrix to help complete the iterative calculation of the page level value vector R: Ri+1=M*R

3. Example of PageRank algorithm

PageRank algorithm example



Fourth, the implementation process of the PageRank algorithm

Experimental content
The links of web pages in the Internet can be regarded as directed graphs, such as four web pages A, B, C, and D. Please use the PageRank algorithm to rank and sort the web pages

Experiment ideas
(1) Define the Score score class, which includes the numerator son and denominator mom attributes in the Score score class, and also includes the static method simplify() for reducing the numerator and denominator, and the static method getAdd() for adding and simplifying scores As a result, the static method getMultiply() is used to multiply fractions and call simplify() to simplify the result.
(2) Define the initial data set dataList, define each Score class object of the transition matrix, and call the InitData() method to initialize the data set.
(3) Define the Score class object v0Score and instantiate it. The v0Score object represents a score value of 1/4, which means that the probability that people click on the four web pages A, B, C, and D is the same 1/4. Define a one-dimensional array V0 to store 4 v0Score s, and the initial PageRank is also assigned to V0.
(4) Enter the while loop and call the getPageRank() method to obtain the PageRank matrix. Define the pageRankList collection in the body of the getPageRank() method to store the final PageRank value, traverse each item of data dataItem in the dataList data set, and use the for loop to traverse the dataItem, so that the dataItem and Vk complete the matrix multiplication, and the result of each item after the matrix multiplication Save it in itemSum. After traversing dataItm, add itemSum to the pageRankList collection. After traversing the data set dataList collection, convert pageRankList into an array and return.
(5) Call the isRankEqual() method to judge whether the newly obtained PageRank matrix value is the same as the PageRank matrix value obtained last time. If not, continue to iterate, and if they are the same, stop iterating. The isRankEqual() method is mainly to judge whether the values ​​in the PageRank matrix of formal parameter V1 and formal parameter V2 are equal, and return a value of boolean type. The method body traverses the array V1 and compares the numerator of each item in V1 with each element in V2. Whether the numerators of an item are the same, if there are different, return false, if there is no difference after the traversal, return true.
(6) If the newly obtained PageRank matrix value is not the same as the PageRank matrix value obtained last time, assign the newly obtained PageRank matrix value to the final output pageRank variable, and output the PageRank value obtained each time.

Realize the source code

Score kind
package com.data.mining.entity;

import lombok.Data;

@Data
public class Score {
    private int son;
    private int mom;

    public Score(){}

    public Score(int s, int m){
        son = s;
        mom = m;
    }

    /**
     * Add the scores and call the simplify method for reduction
     * @param s1
     * @param s2
     * @return
     */
    public static Score getAdd(Score s1, Score s2){
        if (s1.getMom() == 0 || s2.getMom() == 0) return s1.getMom() == 0 ? s2 : s1;
        int commonMom = s1.getMom() * s2.getMom();
        int commonSon = s1.getSon() * s2.getMom() + s2.getSon() * s1.getMom();
        Score addResult = simplify(commonSon, commonMom);
        return addResult;
    }

    /**
     * Multiply the fractions and call the simplify method for reduction
     * @param s1
     * @param s2
     * @return
     */
    public static Score getMultiply(Score s1, Score s2){
        int tempMom = s1.getMom() * s2.getMom();
        int tempSon = s1.getSon() * s2.getSon();
        Score simplifyResult = simplify(tempSon, tempMom);
        return simplifyResult;
    }

    /**
     * Divide the numerator and denominator
     * @param s
     * @param m
     * @return
     */
    private static Score simplify(int s, int m){
        int common = getCommon(s, m);
        s = s / common;
        m = m / common;
        Score result = new Score(s, m);
        return result;
    }

    /**
     * Find the greatest common divisor of the numerator and denominator
     * @param s
     * @param m
     * @return
     */
    private static int getCommon(int s, int m){
        for (int i = s; i >= 1; i--) {
            if (m%i==0 && s%i==0){
                return i;
            }
        }
        return 1;
    }
}

PageRank Algorithm implementation code
package com.data.mining.main;

import com.data.mining.entity.Score;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class PageRank {
    //Define the transition matrix
    public static List<Score[]> dataList = new ArrayList<>();
    //Define each item in the transition matrix
    public static Score s00,s01,s02,s03,s10,s11,s12,s13,s20,s21,s22,s23,s30,s31,s32,s33;

    public static void main(String[] args) {
        initData();
        Score voScore = new Score(1, 4);
        Score[] V0 = {voScore, voScore, voScore, voScore};
        Score[] pageRank = V0;
        while (true){
            Score[] tmpVk = getPageRank(pageRank);
            if (isRankEqual(pageRank, tmpVk)) break; //If the newly obtained PageRank matrix is ​​not the same as the PageRank matrix obtained last time, it will continue to iterate, and if it is the same, it will not iterate
            pageRank = tmpVk;
            System.out.println(Arrays.toString(pageRank));
        }
        System.out.println(Arrays.toString(pageRank));
    }

    /**
     * Determine whether the values ​​in the PageRank matrix of V1 and V2 are equal
     * @param V1
     * @param V2
     * @return
     */
    public static boolean isRankEqual(Score[] V1, Score[] V2){
        for (int i = 0; i < V1.length; i++) {
            int subSon = V1[i].getSon() - V2[i].getSon();
            int subMom = V1[i].getMom() - V2[i].getMom();
            if (subSon != 0 || subMom != 0) return false;
        }
        return true;
    }

    /**
     * Get the PageRank matrix
     * @param Vk
     * @return
     */
    public static Score[] getPageRank(Score[] Vk){
        List<Score> pageRankList = new ArrayList<>();
        for (Score[] dataItem : dataList) {
            Score itemSum = new Score(0,0); //Each item of the PageRank matrix is ​​stored in itemSum
            //Perform matrix multiplication by traversing each row of the dataset and each column of Vk
            for (int i = 0; i < dataItem.length; i++) {
                Score multiply = Score.getMultiply(dataItem[i], Vk[i]);
                itemSum = Score.getAdd(multiply, itemSum); //Accumulate the result of multiplying the corresponding items into itemSum
            }
            pageRankList.add(itemSum);
        }
        return pageRankList.toArray(new Score[pageRankList.size()]);
    }

    /**
     * Initialize the dataset
     */
    public static void initData(){
        s00 = new Score(0, 0);
        s01 = new Score(1, 2);
        s02 = new Score(1, 1);
        s03 = new Score(0, 0);
        s10 = new Score(1, 3);
        s11 = new Score(0, 0);
        s12 = new Score(0, 0);
        s13 = new Score(1, 2);
        s20 = new Score(1, 3);
        s21 = new Score(0, 0);
        s22 = new Score(0, 0);
        s23 = new Score(1, 2);
        s30 = new Score(1, 3);
        s31 = new Score(1, 2);
        s32 = new Score(0, 0);
        s33 = new Score(0, 0);

        Score[] s0 = {s00, s01, s02, s03};
        Score[] s1 = {s10, s11, s12, s13};
        Score[] s2 = {s20, s21, s22, s23};
        Score[] s3 = {s30, s31, s32, s33};

        dataList.add(s0);
        dataList.add(s1);
        dataList.add(s2);
        dataList.add(s3);
    }
}

Experimental results

In the face of this iterative result, the author still has some doubts. The final iterative result in the book is said to be stable and the final result is [3/9, 2/9, 2/9, 2/9], but this experiment In the end, I did not get a satisfactory result, but the program was terminated directly because of data overflow, but it can be seen that the result of the last output is very close to the correct answer, including the results of each iteration. Find out what's the problem. I don’t know why the text on this picture is so small. Anyway, I can’t read it clearly, so use a table to fill it out:

5. Experimental summary

The author does not guarantee that the results of this experiment are correct. The author just provides an idea to implement the PageRank algorithm using the Java language. Because the experiment did not give an answer, the author has input the experimental data with the answer in the book into the program, and the result output by the program is consistent with the answer, so the problem should not be big. If there is something that is not well written, please give me some pointers!
The author's homepage also has a summary of other data mining algorithms, welcome to patronize!

Tags: Java data structure Algorithm Data Mining programming language

Posted by helloise on Mon, 19 Dec 2022 06:17:15 +1030