Tuesday, January 1, 2013

The Skeleton Supporting Search Engine Ranking Systems



A lot of the research I’m interested in relates to networks – measuring the properties of networks and figuring out what those properties mean.  While doing some background reading, I stumbled upon some discussion of the algorithm that search engines use to rank search results.  The automatic ranking of the results that come up when you search for something online is a great example of how understanding networks (in this case, the World Wide Web) can be used to turn a very complicated problem into something simple. 

Ranking search results relies on the assumption that there is some underlying pattern to how information is organized on the WWW- there are a few core websites containing the bulk of the sought-after information surrounded by a group of peripheral websites that reference the core.  Recognizing that the WWW is a network representation of how information is organized and using the properties of the network to detect where that information is centered are the key components to figuring out what websites belong at the top of the search page.

Suppose you look something up on Google (looking for YouTube videos of your favorite band, looking for edifying science writing, tips on octopus pet care, etc):  the search service returns a whole spate of results.  Usually, the pages that Google recommends first end up being the most useful.  How on earth does the search engine get it right?

First I’ll tell you exactly how Google does not work.  When you type in something into the search bar and hit enter, a message is not sent to a guy who works for Google about your query.  That guy does not then look up all of the websites matching your search, does not visit each website to figure out which ones are most relevant to you, and does not rank the pages accordingly before sending a ranked list back to you.  That would be a very silly way to make a search engine work!  It relies on an individual human ranking the search results by hand with each search that’s made.  Maybe we can get around having to hire thousands of people by finding a clever way to automate this process.

So here’s how a search engine does work.  Search engines use robots that crawl around the World Wide Web (sometimes these robots are referred to as “spiders”) finding websites, cataloguing key words that appear on those webpages, and keeping track of all the other sites that link into or away from them.  The search engine then stores all of these websites and lists of their keywords and neighbors in a big database.

Knowing which websites contain which keywords allows a search engine to return a list of websites matching a particular search.  But simply knowing which websites contain which keywords is not enough to know how to order the websites according to their relevance or importance.  Suppose I type “octopus pet care” into Google.  The search yields 413,000 results- far too many for me to comb through at random looking for the web pages that best describe what I’m interested in.

Knowing the ways that different websites connect to one another through hyperlinks is the key to how search engine rankings work.  Thinking of a collection of websites as an ordinary list doesn’t say anything about how those websites relate to one another.  It is more useful to think of the collection of websites as a network, where each website is a node and each hyperlink between two pages is a directed edge in the network.  In a way, these networks are maps that can show us how to get from one website to another by clicking through links.

Here is an example of what a network visualization of a website map of a large portion of the WWW looks like.  (Original full-size image here.)  


Here is a site map for a group of websites that connect to the main page of English Wikipedia.  (Original image from here.)  This smaller site map is closer to the type of site map used when making a search using a search engine.


So, how does knowing the underlying network of the search results help one to find the best website on octopus care (or any other topic)?  The search engine assumes that behind the seemingly random, hodgepodge collection of files on the WWW, there is some organization in the way they connect to one another.  Specifically, the search engine assumes that finding the websites most central to the network of search results is the same as finding the search results with the best information.  Think of a well-known, trusted source of information, like the New York Times.  The NY Times website will have many other websites referencing it by linking to it.  In addition, the NY Times website, being a trusted news source, is likely to refer to the best references for other sources that it wants to refer to, such as Reuters.  High-quality references will also probably have many incoming links from websites that cite them.  So not only does a website like the NY Times sit at the center of many other websites that link to it, but it also frequently connects to other websites that themselves are at the center of many other websites.  It is these most central websites that are probably the best ones to look at when searching for information.

When I search for “octopus pet care” using Google I am necessarily assuming that the search results are organized according to this core-periphery structure, with a group of important core websites central to the network surrounded by many less important peripheral websites that link to the core nodes.  The core websites may also connect to one another.  There may also be websites disconnected from the rest, but these will probably be less important to the search simply because of the disconnection.  Armed with the knowledge of the connections between the different relevant websites and the core-periphery network structure assumption, we may now actually find which of the websites are most central to the network (in the core), and therefore determine which websites to rank highly.

Let’s begin by assigning a quantitative “centrality” score to each of the nodes (websites) in the network, initially guessing that all of the search results are equally important.  (This, of course, is probably not true.  It’s just an initial guess.)  Each node then transfers all* of its centrality score to its neighbors, dividing it evenly between them.  (Starting with a centrality score of 1 with three neighbors, each of those neighbors receives 1/3.)  Each node also receives a some centrality from each neighbor that links in to it.  Following this first step, we find that nodes with many incoming edges will have higher centrality than nodes with few incoming edges.  We can repeat this process of dividing and transferring centrality again.  Nodes with many incoming links will have more centrality to share with their neighbors, and nodes with many incoming links will themselves also receive more centrality.

After repeating this process many times, we begin to see a difference between which nodes have the highest centrality scores:  nodes with high centrality are the ones that have many incoming links, or have links to other central nodes, or both.  This algorithm therefore differentiates between the periphery and the core of the network.  Core nodes receive lots of centrality because they link to one another and because they have lots of incoming links from the periphery.  Peripheral nodes have fewer incoming links and so receive less centrality than the nodes in the core.  Knowing the centrality scores of search results on the WWW makes it pretty straightforward for us to quantitatively rank which of those websites belongs at the top of the list.

Of course, there are more complex ways that one can add to and improve this procedure.  Google’s algorithm PageRank (named for founder Larry Page, not because it is used to rank web pages) and the HITS algorithm developed at Cornell are two examples of more advanced ways of ranking search engine results.  We can go even further: a search engine can keep track of the links that users follow whenever a particular search is made.  (This is almost the same as the company hiring someone to order sought-after web pages automatically whenever a search is made, except all the company lets the user do it for free.)  Over time, search engines can improve their methods for helping us find what we need by learning directly from the way users themselves prioritize which search results they pursue.  Still, these different search engine ranking systems may operate using slightly different methods, but all of them depend on understanding the list of search results within the context of a network.


* It's not always all - there are other variations where nodes only transfer a fraction of their centrality score at each step.


Sources (and further reading):
I wanted to include no mathematics in this post simply because I cannot explain the mathematics behind these algorithms and their convergence properties better than my sources can.  For those of you who want to see the mathematical side of the argument for yourselves (which involves treating the network adjacency matrix as a Markov process and finding its nontrivial steady state eigenvector), do consult the following two textbooks:
• Easley, David, and Jon Kleinberg.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World.  Cambridge University Press, 2010 (Chapter 14 in particular)
• Newman, Mark.  Networks: an Introduction. Oxford University Press, 2010 (Chapter 7 in particular)
A popular book on the early development of network science that contains a lot of information on the structure of the WWW:
• Barabasi, Albert-Laszlo. Linked: How Everything is Connected to Everything Else and What It Means.  Plume, 2003.
A book on the history of modern computing that contains an interesting passage on how search engines learn adaptively from their users (that deserves a shout-out in this blog post).
• Dyson, George.  Turing's Cathedral.  Pantheon, 2012.

24 comments:

  1. Your site should be seo friendly to increase the Search Engine Ranking and it will help your site.

    ReplyDelete
  2. This is the very nice site and this post is really informatics, there are many seo services and increase the search engine ranking and it will help your site.

    best seo company

    ReplyDelete
  3. Thanks for sharing this post. is post really enlighten my mind… gradually, the used of the right SEO is clearer to me now! search engine ranking

    ReplyDelete
  4. Great article! Thanks for sharing and more power to you!

    ReplyDelete
  5. ePageWiz will be a program that also converts html pages to be able to a remain-alone EXE post. This ebook software offers full support of multimedia contents (HTML Pages, Java Scripts, Flash Files, VB Scripts, Graphic Files, Video Files Fashion Sheets, Sound Files and etc). This program also gets a chance to help password protect your post from copyright violators and the capability in order to add 'trial limitation' which require users to register in order to have full reach in order to the manuscript.

    ReplyDelete
  6. you have an urgent Skeleton theme support request, at this time the paid option is the best means.seo sheffield

    ReplyDelete
  7. Recognizing that the WWW is a network representation of how information is organized and using the properties of the network to detect where that information is centered are the key components to figuring out what websites belong at the top of the search page.
    SEO services

    ReplyDelete
  8. What a great blog, i am searching in Google from couple of days.Two Simple Ways to Get Traffic to Your Blog” but did not find any great way,but my search came to an end after visiting your blog.
    Find a band

    ReplyDelete
  9. I first heard about Search Engine Ranking and Optimizing from a Scottsdale SEO Company. If you have the proper tools and people you can get your brand to be at the top of SERPs, hence raising brand awareness. Search engine is a wonderful place to be in, it's easier to ask GOOGLE with questions rather than asking Parents.

    ReplyDelete
  10. Best SEO tool is a one touch resource which provides you a free SEO report for your website. We offer free and instant online reports of webpage rankings in the top 5 search engines: Google (top 10), Bing (top 10) and Yahoo! (top 50).

    ReplyDelete
  11. You completed certain reliable points there. I did a search on the subject and found nearly all persons will agree with your blog.

    ReplyDelete
  12. Fabulous post i have got her that you have shared.

    ReplyDelete
  13. This post publish new important news of seo.This post article is important and helpful for SEO services .Thanks a lot for share your great post.

    ReplyDelete
  14. Awesome and interesting article. Great things you've always shared with us. Thanks. Just continue composing this kind of post.
    Choose Professional SEO services

    ReplyDelete
  15. Home inspections are usually conducted by a home inspector who has the training and certifications to perform such inspections. The inspector prepares and delivers to the client a written report of findings. hampton bay

    ReplyDelete
  16. A huge people come this temple because they believe at that place every wish must be completed. Popular Xbox 360 games

    ReplyDelete
  17. I have not heard about before. Keep sharing more blog posts on this Community and get extra benefits with this ideas and knowledge. Thanks for this one.
    saatchigallery compe
    techstarusa saatchigallery
    dzone

    ReplyDelete
  18. your post was very useful and informative. Thank you so much for such a great piece of information. Please check out some more information regarding what is serps.

    ReplyDelete
  19. TWOMC artistic Digital Agency. Contact our offices at state, Hong Kong, and Canada.
    Source: Digital Agency Indonesia

    ReplyDelete
  20. Best Fiverr Outsourcing Micro Jobs Pakistan Great Affordable Seo Services.
    http://seo-prince.com

    ReplyDelete
  21. social media marketing in orange county ,orange county digital marketing agency,social media marketing for dentists.
    orange county digital marketing agency

    ReplyDelete
  22. Best digital marketing agency, social media marketing, business consulting services, community engagement,
    social media marketing in orange county

    ReplyDelete
  23. online digital marketing agency provide you the best Digital marketing for all the beginerrs
    http://twomc.com/portfolio.html

    ReplyDelete