Archive

Archive for the ‘Web Programming’ Category

Creating a static copy of a dynamic website

August 31st, 2010 Behzad No comments

From blog entry at: http://blog.jphoude.qc.ca/2007/10/16/creating-static-copy-of-a-dynamic-website/

At work we have several websites that we develop, but each year we make a new version and we want to keep an archive of the old version.

Since it takes a lot of memory to keep a Zope instance for these old websites that probably won’t need to be edited ever again, it makes sense to make a static copy of the website. It also eliminates the work needed to update the instance when security patches come out (and eliminates security risks, in cases of old versions that are no more maintained).

There are some tools that can help in this case; I chose to use wget, which is available in most Linux distributions by default.

The command line, in short…

# wget -k -K  -E -r -l 10 -p -N -F --restrict-file-names=windows -nH http://website.com/

…and the options explained

-k : convert links to relative
-K : keep an original versions of files without the conversions made by wget
-E : rename html files to .html (if they don’t already have an htm(l) extension)
-r : recursive… of course we want to make a recursive copy
-l 10 : the maximum level of recursion. if you have a really big website you may need to put a higher number, but 10 levels should be enough.
-p : download all necessary files for each page (css, js, images)
-N : Turn on time-stamping.
-F : When input is read from a file, force it to be treated as an HTML file.
-nH : By default, wget put files in a directory named after the site’s hostname. This will disabled creating of those hostname directories and put everything in the current directory.
–restrict-file-names=windows : may be useful if you want to copy the files to a Windows PC.

Possible problems

  • wget download the homagepage, robots.txt then stops!
    Your robots.txt file probably denies access to your site to search engines. Yes, in recursive mode, wget will respect the robots.txt file, so you will need to remove it before making the copy. Don’t forget to put it back in the static site if that’s what you want.
  • Stylesheets : if you have @import stylesheet imports, wget won’t see them, and won’t download them :( You might want to change them to <link rel=”stylesheet” … /> imports, which wget will see and download.
  • Stylesheet images : wget won’t download background-images referenced in CSS files. For most websites that should not be too long to download those images manually.
  • Be sure that you CSS files and with “.css”! Apache won’t send the correct mime-type if your file extension is not .css, and Firefox will not use the stylesheet.
    (test.css?color=blue won’t work, change it to test.css?color=blue&ext=.css)
    The same problem may happen with other files types that need to have a proper mimetype set (video files, for instance)
  • LinguaPlone specific problems
    • To prevent having several duplicated files with the set_language parameter, you could setup one subdomain for each language, and force the set_language= in the Apache redirect rule.
    • I also recommand to change the language link so it points to the main page instead of the current page.
    • You have several possibilities here, but by just doing a wget without changing anything, you may end up with pages where languages are a bit fucked up.
  • <base> tag problem : If you pages contains a base tag (which is true for Plone sites), wget will empty it’s value but leave the base tag there ([base href="" /]). That works in Firefox, but it will confuse IE, which won’t load any images, CSS or links.To fix it, you can remove the base tag completely with this command :
    # find | grep html$ | xargs perl -i -p -e 's/<base href=\"\" \/>//g'

Downsides

  • Most file names will change (bad for SEO)
  • May take some manual work to have a working static copy

After taking care of all the possible problems, you should have a working static site! Be sure to check with both IE and Firefox (at least), because some problems happen in only one browser.
Then, you can shut down your CMS and server the static content using a standard webserver.

Don’t forget to put a nice 404 page pointing to your main page, since your URLs probably changed, and several visitors will get a 404 error if they come from search engines or bookmarks.

Categories: Linux | CentOS, Web Programming Tags:

CakePHP 1.2 Release (with a New Site Design)

February 16th, 2008 Behzad No comments

 CakePHP

As Chris Hartjes points out there’s a new release of the popular PHP framework CakePHP (as well as a new web site design).

You can grab the latest download directly from the homepage or look into the manual to find out more about the framework and how it can be used.

Custom PHP.ini File With Your Linux Shared Hosting

February 13th, 2008 Behzad No comments

If your host allows you to, you can use a custom php.ini file within your Linux shared hosting environment. By utilizing .htaccess you are able to create a custom fileset which you can then use to disable and enable PHP functions as per your wish within your website; this can be done site-wide or directory-wide. This can be helpful to you if your host has disabled certain modules within the server wide php.ini file, or if there are modules which are enables but you’d rather have disabled, for example if they don’t work with a PHP application which you have installed on your website. As long as you are able to within your shared hosting environment, the deployment of a custom php.ini file is relatively easy if the following steps are followed.

Deploying a Custom PHP.ini File

First off, you need to create a .htaccess file; you might already have one within your hosting environment, if so you can easily edit. In either case, you will need to add the following line to the .htaccess to enable Apache to find the custom php.ini file:

SetEnv PHPRC /path/to/custom/php.ini

You will then need to create the custom php.ini file itself within your site – the file can actually be named anything and can have any posess any extension, if you want it to. Within this file you are able to specify whether PHP modules are enabled or disabled for your website – Apache will pick the custom php.ini file up via the use of your .htaccess file and will then apply your custom settings to your virtual environment on run time. For our example we will be disabling PHP’s use of magic quotes. To disable PHP’s use of magic quotes on run time, we will need to paste the following three lines into our custom PHP.ini file:

magic_quotes_gpc = Off
magic_quotes_runtime = Off
magic_quotes_sybase = Off

Once that has been done, as long as the .htaccess file is pointed towards the correct custom php.ini file, Apache should pick the changes up on the next run and the specified modules or settings will take effect every time your site is run as long as the appropriate files are in their correct locations. By doing this, you are able to ensure that all custom settings for your website are served on the next run, meaning that they will always be applied to it. Another example of a module being changed would be the setting of the time zone for your website through the use of a custom php.ini file, to do this we need to paste the following code into the file:

date.timezone = “America/Indianapolis”

The example shown sets the date timezone to Indiana, America; if we want to set our time zone of GMT London, we need to paste in the following code:

date.timezone = “Europe/London”

Changing the time zone can be an important change, since your website might be hosted in somewhere like America, but your audience however might be Australia – that is a vast time difference. Some applications are time/date sensitive, which means that you should ensure that you change the time zone to fit in with your audience, as it could cause confusion with both the system and your visitors if a different time from what they’re use to is displayed, or if completely the wrong day is set.

Conclusion

With the use of .htaccess you are able to use a custom php.ini at either site or directory levels within yoru website, allowing you to run your website with the appropriate settings as per your wish. This is good since it allows you to achieve things which in some cases are only achievable within a dedicated environment such as a VPS server or a dedicated server. Also, some applications only work if certain PHP settings are enabled or disabled, so this allows you to customize your environment to ensure you are able to run any type of PHP application. This sort of tweaking is something that certainly allows you to make the best out of your shared or reseller website hosting.