|
Structure extraction of websites information is the basis of many other technologies about classifying the website. In this paper, some different algorithms that are used to extract the structure of the website information are listed, and this paper also analyzes the advantages and disadvantages of those different algorithms. Above all, a method about structure extraction of the website information based on the degree of link association is put forward in the paper. First of all, it's needed to extract the content of every page of the target website, secondly, we can use the page after the extraction of content to calculate the dissimilarity of pages and calculate the dissimilarity of the links of two pages, then we can also get the route which is from the home page to the target page by the dijkstra algorithm, finally, the structure of the whole website can be produced through the route. |
|
Keywords:Pattern recognition; Structure of the website information; Content extraction; Link association |
|