A typical problem arising when automatically crawling a website, is having an ability to detect session identifiers. When a client fails to provide correct session identifiers a web application will terminate the session and the crawl operation will result in poor application coverage.
Session identifiers can be transmitted as part of request parameters, cookies or other dynamic elements, for example, path constructions resulting from URL rewriting operations.
Current solutions for problems associated with session identification are typically less reliable because the solutions use heuristics including known session identifier name patterns, entropy of the value, and other heuristics. Issues associated with the current approaches include a reliance on expert knowledge to create the identifiers, and a dependency on common practices of servers to populate identifier values.
Although current detection methods can prove effective the methods typically do not cover all the cases. For example, when performing a URL rewriting operation session identifiers with a relevant value are passed in the request path such as the path sample of: GET /S(120fd4ovfqyogf34f)/home.asp HTTP/1.1 which is a non-standard or private method of transferring session information. Using the example a web crawler must be pre-configured to detect and identify the session identifier value. Because such implementations are left to the creativity of web application developers maintaining a reliable set of heuristics to identify URL rewriting session identifiers is typically impossible other than for the web application developers of the specific application.
In addition, since the session identifier values are part of a path the values can be easily confused with folder names influencing navigation of a site. A web crawler might become stuck in an endless loop because there could be an infinite set of values for the session id path element. Constructs as provided in the example can occur anywhere in a request and are very common in implementations using Web 2.0 technologies, for example Ajax callbacks.