{"id":251,"date":"2021-02-07T09:29:17","date_gmt":"2021-02-07T17:29:17","guid":{"rendered":"http:\/\/junsun.net\/wordpress\/?p=251"},"modified":"2021-02-07T09:36:45","modified_gmt":"2021-02-07T17:36:45","slug":"clone-a-website-behind-login-with-winhttrack","status":"publish","type":"post","link":"http:\/\/junsun.net\/wordpress\/2021\/02\/clone-a-website-behind-login-with-winhttrack\/","title":{"rendered":"Clone a Website Behind Login with WinHTTrack"},"content":{"rendered":"\n<p>I have an very old Fedora server that stopped active running almost 10 years ago.  I need to backup its data for archive and physically dispose it.  <\/p>\n\n\n\n<p>It used to run wiki site based on <a href=\"http:\/\/moinmo.in\/\">moinmoin<\/a>.  If I just back its static files plus database files, chances are I will never be able to re-active those wiki pages and see them again.  So I thought a better idea is to clone the web pages and turn them into static local web site, which I can still access easily.<\/p>\n\n\n\n<p>It turns the journey is more complicated than expected.  It took me more than a couple of hours to finally nail it down.  So I figure it is worth a post here.<\/p>\n\n\n\n<h2>The Software<\/h2>\n\n\n\n<p>A quick search shows HTTrack seems to be best software for this job.  It has a GUI version for Linux, called WebHTTrack.  It also has a windows version called WinHTTrack.  During my fiddling around, I ended up using WinHTTrack.  In retrospect, the method used here should also work for WebHTTrack.<\/p>\n\n\n\n<h2>The Challenge<\/h2>\n\n\n\n<p>WinHTTrack is generally user friendly.  The biggest problem is moin wiki requires a user login to view the content.<\/p>\n\n\n\n<p><a href=\"https:\/\/askubuntu.com\/questions\/409736\/httrack-how-to-copy-a-website-with-email-based-username-login\">The cookie capture method<\/a> (&#8220;Add URL&#8221; followed with &#8220;Capture URL&#8221;) does not work for moin wiki.   Only first page works.  Later pages still returns &#8220;You are not allowed to view this page&#8221;<\/p>\n\n\n\n<h2>The Solution<\/h2>\n\n\n\n<h3>Step 1 &#8211; prepare cookies.txt file<\/h3>\n\n\n\n<ul><li>Install and launch firefox browser<\/li><li>Install <a href=\"https:\/\/addons.mozilla.org\/en-US\/firefox\/addon\/cookies-txt\/\">cookies.txt extension<\/a><\/li><li>Log into the web site and start browsing<\/li><li>Click on &#8220;cookies.txt&#8221; extension and export cookies.txt for the current site.<\/li><li>Also note the firefox user agend<ul><li>Clock on top-right &#8220;settings&#8221; icon<\/li><li>Then click on &#8220;Help&#8221;\/&#8221;Troubleshooting Information&#8221;<\/li><li>Note the &#8220;User Agent&#8221; string.  We will use it later.<\/li><\/ul><\/li><\/ul>\n\n\n\n<h3>Step 2 &#8211; run WinHTTrack once<\/h3>\n\n\n\n<ul><li>Start WinHTTrack and set up the project normally without worrying about login<ul><li>Specifically, you don&#8217;t need to use &#8220;Add URL&#8221; button to do anything special.  Just type in the URL in the text input area.<\/li><\/ul><\/li><li>Click on &#8220;Finish&#8221; to start downloading<ul><li>It will warn the site seems empty.  That is expected.<\/li><\/ul><\/li><\/ul>\n\n\n\n<h3> Step 3 &#8211; copy over the cookies.txt<\/h3>\n\n\n\n<ul><li>Copy the exported cookies.txt file to the top level directory of the local clone.<ul><li>For example, if HTTrack working directory is &#8220;%USER%\\Documents\\httrack-websites&#8221; and your project name is &#8220;erick-wiki&#8221;, then the destination directory is &#8220;%USER%\\Documents\\httrack-websites\\erick-wiki&#8221;<\/li><\/ul><\/li><\/ul>\n\n\n\n<h2>Step 4 &#8211; run it again with proper options<\/h2>\n\n\n\n<p>A few options had to be set properly<\/p>\n\n\n\n<ul><li>Click on &#8220;Set options &#8230;&#8221; and a new pop-up window shows up<\/li><li>On &#8220;Spider&#8221; tab, set &#8220;Spider&#8221; to &#8220;no robots.txt rules&#8221;<\/li><li>On &#8220;Browser ID&#8221; tab, set &#8220;Browser Identity&#8221; to the &#8220;User Agent&#8221; of Firefox browser noted in step 1<\/li><li>On &#8220;Scan Rules&#8221; tab, click on &#8220;Exclude links&#8221;, another pop-up windows shows up<ul><li>Here you can exclude the files you like to clone<\/li><li>Most importantly, you need to exclude the &#8220;logout&#8221; URL.  Otherwise cookies.txt will be updated\/deleted and you will note be able to continue cloning<ul><li>In my case, the URL will end with &#8220;?action=logout&#8221;.  So I set &#8220;criterion&#8221; to &#8220;Links containing:&#8221;, and set &#8220;String&#8221; to &#8220;action=logout&#8221;<\/li><\/ul><\/li><\/ul><\/li><li>(Optional) On &#8220;Experts Only&#8221; tab,  change &#8220;Travel mode&#8221; to &#8220;can go both up and down&#8221;<\/li><li>(Optional) On &#8220;Flow control&#8221;, set &#8220;number of connections&#8221; to higher number for faster cloning.   <em>Note: your website may have surge limit as Moinmoin does, in which case you will need to turn off this feature on the web server in order to increase bandwidth.<\/em><\/li><li>Once all set, click on &#8220;Next&#8221; and &#8220;Finish&#8221; to start cloning.<\/li><\/ul>\n\n\n\n<p>That is it!<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have an very old Fedora server that stopped active running almost 10 years ago. I need to backup its data for archive and physically dispose it. It used to run wiki site based on moinmoin. If I just back its static files plus database files, chances are I will never be able to re-active &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/junsun.net\/wordpress\/2021\/02\/clone-a-website-behind-login-with-winhttrack\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Clone a Website Behind Login with WinHTTrack&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":""},"categories":[3],"tags":[65,63,64],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/posts\/251"}],"collection":[{"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/comments?post=251"}],"version-history":[{"count":2,"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/posts\/251\/revisions"}],"predecessor-version":[{"id":253,"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/posts\/251\/revisions\/253"}],"wp:attachment":[{"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/media?parent=251"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/categories?post=251"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/junsun.net\/wordpress\/wp-json\/wp\/v2\/tags?post=251"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}