Review Webserver Metafiles for Information Leakage (WSTG-INFO-03)
Last updated
Last updated
WSTG-INFO-03
This section describes how to test various metadata files for information leakage of the web application's path(s), or functionality. Furthermore, the list of directories that are to be avoided by Spiders, Robots, or Crawlers can also be created as a dependency for . Other information may also be collected to identify attack surface, technology details, or for use in social engineering engagement.
Identify hidden or obfuscated paths and functionality through the analysis of metadata files.
Extract and map other information that could lead to a better understanding of the systems at hand.
Any of the actions performed below with
wget
could also be done withcurl
. Many Dynamic Application Security Testing (DAST) tools such as ZAP and Burp Suite include checks or parsing for these resources as part of their spider/crawler functionality. They can also be identified using various or leveraging advanced search features such asinurl:
.
Web Spiders, Robots, or Crawlers retrieve a web page and then recursively traverse hyperlinks to retrieve further web content. Their accepted behavior is specified by the of the file in the web root directory.
As an example, the beginning of the robots.txt
file from sampled on 2020 May 5 is quoted below:
The Disallow
directive specifies which resources are prohibited by spiders/robots/crawlers. In the example above, the following are prohibited:
The robots.txt
file is retrieved from the web root directory of the web server. For example, to retrieve the robots.txt
from www.google.com
using wget
or curl
:
Sign into Google Webmaster Tools with a Google account.
On the dashboard, enter the URL for the site to be analyzed.
Choose between the available methods and follow the on screen instruction.
If there is no <META NAME="ROBOTS" ... >
entry then the "Robots Exclusion Protocol" defaults to INDEX,FOLLOW
respectively. Therefore, the other two valid entries defined by the "Robots Exclusion Protocol" are prefixed with NO...
i.e. NOINDEX
and NOFOLLOW
.
Based on the Disallow directive(s) listed within the robots.txt
file in webroot, a regular expression search for <META NAME="ROBOTS"
within each web page is undertaken and the result compared to the robots.txt
file in webroot.
Organizations often embed informational META tags in web content to support various technologies such as screen readers, social networking previews, search engine indexing, etc. Such meta-information can be of value to testers in identifying technologies used, and additional paths/functionality to explore and test. The following meta information was retrieved from www.whitehouse.gov
via View Page Source on 2020 May 05:
A sitemap is a file where a developer or organization can provide information about the pages, videos, and other files offered by the site or application, and the relationship between them. Search engines can use this file to more intelligently explore your site. Testers can use sitemap.xml
files to learn more about the site or application to explore it more completely.
The following excerpt is from Google's primary sitemap retrieved 2020 May 05.
Exploring from there a tester may wish to retrieve the gmail sitemap https://www.google.com/gmail/sitemap.xml
:
Identifying further paths or resources to include in discovery/analysis.
Open Source intelligence gathering.
Finding information on Bug Bounties, etc.
Social Engineering.
The file may be present either in the root of the webserver or in the .well-known/
directory. Ex:
https://example.com/security.txt
https://example.com/.well-known/security.txt
Here is a real world example retrieved from LinkedIn 2020 May 05:
humans.txt
is an initiative for knowing the people behind a website. It takes the form of a text file that contains information about the different people who have contributed to building the website. This file often (though not always) contains information for career or job sites/paths.
The following example was retrieved from Google 2020 May 05:
It would be fairly simple for a tester to review the RFC/drafts and create a list to be supplied to a crawler or fuzzer, in order to verify the existence or content of such files.
Browser (View Source or Dev Tools functionality)
curl
wget
Burp Suite
ZAP
The directive refers to the specific web spider/robot/crawler. For example, the User-Agent: Googlebot
refers to the spider from Google while User-Agent: bingbot
refers to a crawler from Microsoft. User-Agent: *
in the example above applies to all .
Web spiders/robots/crawlers can the Disallow
directives specified in a robots.txt
file. Hence, robots.txt
should not be considered as a mechanism to enforce restrictions on how web content is accessed, stored, or republished by third parties.
Web site owners can use the Google "Analyze robots.txt" function to analyze the website as part of its . This tool can assist with testing and the procedure is as follows:
<META>
tags are located within the HEAD
section of each HTML document and should be consistent across a web site in the event that the robot/spider/crawler start point does not begin from a document link other than webroot i.e. a . Robots directive can also be specified through use of a specific .
was ratified by the IETF as which allows websites to define security policies and contact details. There are multiple reasons this might be of interest in testing scenarios, including but not limited to:
There are other RFCs and Internet drafts which suggest standardized uses of files within the .well-known/
directory. Lists of which can be found or .