Wednesday, December 3, 2008

Maximum File Size for Crawling

By default, Search Services can crawl and filter a file with a size of up to 16 megabytes (MB). It will always crawl the first 16MB of a file. After this limit is reached, SharePoint Portal Server enters a warning in the gatherer log “The file reached the maximum download limit. Check that the full text of the document can be meaningfully crawled.”

To increase the limit of 16 MB, you must add in the registry new entry MaxDownloadSize. To do this, follow these steps:

1. Start Registry Editor (Regedit.exe).
2. Locate the following key in the registry:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager
3. Open Edit - New - DWORD Value. Name it MaxDownloadSize.
4. Double-click, change the value to Decimal, and type the maximum size (in MB) for files that the gatherer downloads.
5. Restart the server.
6. Start Full Crawl.

NOTE: Increasing the file size may cause a timeout exception because the crawler can timeout if the file takes too long to crawl/index (because of its size). To increase timeout value, follow these steps:

1. In Central Administration, on the Application Management tab, in the Search section, click Manage search service.
2. On the Manage Search Service page, in the Farm-Level Search Settings section, click Farm-level search settings.
3. In the Timeout Settings section change Connection and Request acknowledgement time.

Wednesday, November 19, 2008

The publishing portal template explained

In our real world example as mentioned above we were using the publishing portal example and our customer had some specific wishes which we could not meet out of the box.
So, how does this template work? Basically the site definition used is the BLANKINTERNET template you can find in the famous 12/template/site templates directory.
When you open ONET.XML you will see several configuration tags. Remember how your site already contains a root site and two sub sites (Press Releases and the hidden Search center)? Well, for the root site, BLANKINTERNET#0 is the one that is being used. For the Press releases subsite it is BLANKINTERNET#1 and there is another one called BLANKINTERNET#2 and that one is for the new subsites you create yourself.
So what happens when you select 'Create site collection' and select the 'Publishing portal' template? Well, that has been defined in the webtemp*.xml files you can find at c:\Program Files\Common Files\Microsoft Shared\web server extensions\12\TEMPLATE\1033\XML. You are probably not aware that there is another important file called internetblank.xml which is located at c:\Program Files\Common Files\Microsoft Shared\web server extensions\12\TEMPLATE\XML. The contents of that file looks like this:

As you can see, it actually defines your initial site structure and it uses the configurations from the ONET.XML.
In the Webtempsps.xml file you will see a list of all the templates that will show up in the Template picker when you create your site collection. The Publishing Portal definition looks little bit different from the other ones as it uses a provisioning technique to create not only one site but a site structure as has been defined in the internetblank.xml file: