cleanPages Extension - an arc90 Readability conversion
Wednesday, January 19, 2011 11:38:59 AM
For discussions about cleanPages v 1.5 please visit the new service page
cleanPages is NOT an adblocker or scriptblocker; it cleans pages for reading or printing after they have been loaded.
Version: 1.0.2
Download from the addons page: cleanPages
Supported Languages: English, French, German, Italian, Polish, Portuguese, Russian, Turkish and Swedish.
Latest test version:
1.5.54
Warning: May be unstable!
Date: 2012-07-10 17:23 GMT+2
Download from my private server: cleanPages.oex
A warning will be displayed, you'll have to trust me

Snapshot users:
Please add http://quhno.internetstrahlen.de to the trusted repositories.
Menu -> Settings -> Preferences -> Advanced -> Security -> Trusted Web sites (Trusted Repositories Tab)
Known issue:
In the latest Opera snapshots sometimes the settings will be lost after an Opera restart. Not my fault, other extensions suffer under the same issue, it is an Opera bug (CORE-47777).
Changed in the Alpha (latest changes on top):
- kill setInterval() for merged pages too
- prevent set click event listeners on body or documentElement from bleeding into the cleaned page
- removed independent setting for line-height because of
DSK-344053 - kill scripts that are started by setInterval() too
- cleaned upt settings page
- Some minor fixes for hidden contents - now they should stay hidden
- Fix for headlines: Big justified text looked ugly, changed to pure right or left align. Known problem: Wrong align on RTL (Arabic or Hebrew) headlines and text align justify or center, please select text-align right when reading these.
- Fix for content images when screen is very narrow. restricted maximum width to text width.
- Experimental change of the content image detection heuristics
- Deleting non displayed content, i.e. content with display:none, visibility:hidden, opacity:0
- Added Italian translation
- "edit" links in most Wikis will be removed
- Added Turkish translation
- Switched off pagination for sciencesetavenir.fr
- Squashed bug where too long entries in the color input fields failed to update the preview
- Prepared internationalization of the preferences page
- Added Russian help
- Added help localization structure and German help page
- Extension's button can be hidden (see help page)
- the 3 tool buttons top left in the page are now hidden by default.
- Added Help page
- Changed the way the CSS is applied, should be more robust now.
- Extension resets itself to default values after a fresh install.
- added setting for uncolorized black background around the cleaned text.
- added some options to the preferences page: show images, show vimeo and youtube videos, merge paginated pages
- Workaround for Opera CORE-23171
- Options styled and some minor changes
- Mouse gesture support - (see help file)
- Faster reload
- Added Ctrl+Shift+R as keyboard shortcut to start the extension and to reload the original page
- New icon
- Extended font support. Detects installed fonts from a list of 509 of the most common installed fonts on your computer.
- Added autoscrolling feature, watch the upper right corner.
- New settings page
- Bugfix: Additional footnote anchors in text if the button was clicked more than once. (see details)
- Improved the next page detection but there are still some quirks left (at least I hope id didn't break it too much)
- Bugfix: elements styled by <u><b> were removed including their contents.
- Improved duplicate pages detection on multi page articles - should work now correctly with my.opera blog articles with more than one comment page too.
- changed width setting to fixed values and a percentile maximum width to avoid horizontal scroll bars if set too wide if the window width is changed afterwards.
- experimental fix for H2s abused as intro
Known issues of the Alpha
line height changes are only applied after window size change or setting of font size or spacing in 11.50+. Not my fault, reported as Opera Bug DSK-344053No internationalization for the help page- Opera 12 sometimes doesn't want to clean the page again on pressing the extension's button after a previous cleaning attempt on the same page. They broke it. Use CTRL+SHIFT+R instead, that works reliably.
- Several more issues
Please post major errors you encounter in the basic functionality here in the blog comments. Thank You!
Usage
If the extension's button is active, you can click on it to change the layout of the active tab's content - or you can select some text (300+ characters) and click the button to make that text readable. If you selected too little text, cleanPages switches back to the default mode and tries to find the relevant content on its own.
cleanPages shows 3 buttons on the cleaned webpage:
- Reload: It has basically the same behavior as the normal reload button in the browser. It is a true reload except when used on frame sites, then the history is used to go back to the same subframes as before. (Read about History Navigation Mode quirks below)
- Print: Opens the Print dialog to print the cleaned page. Text will be black, backgrounds will be white, the buttons will be not printed.
- Email: Opens the default email client on your system with the page's URL as body text. Feel free to edit subject and body text to something more meaningful than the included default text.
Preferences | Options
To set up the extension's preferences:
- Right-click the button of the extension
- Choose "Preferences"
- On the preferences page, change the settings in each column at least once and tick or untick the "... footnotes" checkbox. This makes the settings permanent as long as the extension is installed (only necessary after a new install, later you can change each setting individually).
You can see a preview with sample text in the "Example" box below the settings. The settings can be changed again any time later by re-opening the "Preferences".
Supported Languages
cleanPages comes in:
English, French, German, Italian, Polish, Portuguese, Russian, Turkish and Swedish.
The language is set according to your browser language settings and defaults to English for languages not yet supported. The functionallity of cleanPages is independend from languages, one of my test users reported, that it works just fine on Japanese pages.
Please send me a personal message or leave a comment here, if you can and want to translate it into your language.
Changes to the Original Readability™
- Added multilingual preferences and user dialogs.
- Fixed some frame issues. Overwriting or replacing the body of the top document in a frameset is not allowed in Opera because of security restrictions.
- Removed included Typekit fonts. I have no license to use them and I don't intend to buy one.
- Removed original JS smooth scrolling. Operas built in is good enough. Use [space] to scroll down a page and [shift]+[space] to scroll up a page.
- Reactivated the Terminal style.
- Removed the Athleas style.
- Improved the font stacks for cross system use.
- Removed bad browser sniffing because Opera can mask as IE. That wouldn't have worked out.
- Removed or replaced Firefox-only code. (read: Firefox only Bug workarounds for not following the W3C specifications
)
Various other fixes, see source code of the included script. All changes are marked with /*q ... */
Known Issues
The description is not multilingual. Not my fault, kick Opera for that, especially the person who wrote the parser that checks the config.xml during the publishing process for validity. It doesn't even respect their own specifications.

cleanPages, like the original "Arc90 Readability™" bookmarklet, does not work well with:
- Start pages of a website. Navigate to an article page before you use the extension. I will not change that, my version of cleanPages should stay a small extension with a low system impact. If you think otherwise: feel free to edit it, it is licensed under Apache 2.0
- Pages with not enough text to analyze. Not possible. No way.
- Pages with crappy markup. It will do the best it can.
- Some kinds of frameset pages. However frame pages without forced frame reload should work fine.
- Pages that are reloaded with Unser Prefs|History Navigation Mode set to "Auto" (1, default) or "Fast" (3). It works better when set to "Compatible" (2).
To switch between all 3 settings you can use this button:
History Navigation Mode
Further known issues: I hope not

cleanPages comes AS IS, meaning:
I won't fix mistakes that other people made on their websites. If it works, it works. If not and if it is my fault, leave a comment below.
If you find any real bugs, please post them in the comments, too.
If it destroys your hard-disc and melts your processor: Buy a new computer

Legal Stuff
"Readability™" is a Trademark of Arc90, http://arc90.com
Permission to use the code was granted by license and email.
Outdated.
For discussions about cleanPages v 1.5 please visit the new service page







1 2 3 4 5 6 7 8 9 Next »
QuHno # Wednesday, January 19, 2011 9:34:53 PM
I'll upload the extension with the multilingual config.xml to my own webspace, as soon as the extension gets its approval.
Sorry for the inconvenience.
BjoernDBjörni # Thursday, January 20, 2011 10:40:20 AM
(edit: Added by QuHno: Er meint sein Blog)
QuHno # Thursday, January 20, 2011 11:42:12 AM
Erst wenn sie fertig ist
Fertig
Du musst nicht, aber niemand wird Dich aufhalten, wenn Du es tust
QuHno # Friday, January 21, 2011 11:19:17 PM
I too, but I can't promise anything. I am still testing and looking what I can do, I am no JavaScript Guru, just a qualified copy&paste guy.
owgrunt wrote at the addons page:
At the moment I don't change original text align at all because it can be be very annoying if you look at a page with code examples and they are all justified ...
I'll look into it and if I find a solution that works (meaning: doesn't destroy PRE, CODE etc. and doesn't destroy the image algorithm), may be I'll add a checkbox for justify, but, like above, I can't promise anything, so don't hold your breath.
metude # Sunday, January 23, 2011 3:23:26 PM
QuHno # Monday, January 24, 2011 2:50:39 PM
Yes, that would be really nice but I don't know if it is possible to read out the internal font list at all. There are some hacks that claim to do that, but the are all quite "expensive" when it comes to computing time and they all work to 80% at most apart from adding a SWF file, what isn't allowed in an extension AFAIK...
dmitso wrote at the addons page:
Which page?
http://www.smithsonianmag.com/history-archaeology/Bodybuilders-Through-the-Ages.html
works for me, it even pagerizes the article but I have seen the problem at this page:
http://www.smithsonianmag.com/science-nature/Tracking-the-Elusive-Lynx.html
The markup of that page is ... weird and heavy AJAX loaded, not easy to parse, the original script fails there too.
Those **** webmasters at Smithonian packed the article content into a container with the class "subNavSponsoredWithPic" which hits 2 stoppwords at the same time: Nav and Sponsor. Absolutely stupid and un-semantic markup.I'll look into it and if I can find a way to solve that problem without a generic hack for Smithonian alone, I'll do it.
Hint: In the meantime you can use the added functionality that I built in, select the text with the mouse first and press the button then.
I will add a generic hack for that page in the next version, which will come with 3 more translations, too
edit:
Test version 1.0.2-alpha
Changes:
Smithonianmag.com fix and Italian and Turkish translations added. Waiting for the 3rd translation to become ready and I'll submit it if there are no new errors
QuHno # Tuesday, January 25, 2011 5:00:20 PM
Changes:
Polish translation added.
QuHno # Wednesday, January 26, 2011 9:02:10 AM
BTW: There is no bug at the extensions page, there was at least one 4 Star rating because the Smithonian page doesn't work in the 1.0.1 version (see comment above). Only ratings with comments are shown in the overview, but there are less comments than ratings
DavidGPeters wrote at the addons page:
Thanks for the flowers, but Arc90 did most of the work - it would have been impossible for me without their content search algorithm - and I had a lot of help by several comunity members of the my.opera community
BTW: The Highlighter works just fine without any changes when started as UserJS (see screenshot) - but because my extension is multilingual, I'd need lists of common stop words for the different languages apart from English and German
and make the UTF-8 chartable bigger because cleanPages wotks with other charsets than iso-latin too (even CJK)edit: The chartable for exotic characters seems to be quite complete up to the FFxx charactersIt should be possible to integrate it but I have to check the license first and ask the author of the original script under which conditions it is allowed (my Japanese is still bad, despite several years of learning
I could throw out the Greasemonkey auto updater and the getElementsByClass routine because Opera Extensions can do both by default
DavidGPeters # Wednesday, January 26, 2011 9:46:20 AM
Originally posted by QuHno:
How did you do that...? I followed this tutorial:
http://www.mydigitallife.info/2009/12/11/how-to-install-greasemonkey-user-script-javascript-in-opera/
...but I still can't get the Top Words Highlighter Script
http://top-words-highlighter.googlecode.com/files/top_words_highlighter.user.js
...to work in Opera (Opera Portable, that is).
If I press the keys "Ctrl+Y" in Opera, nothing happens (this shortcut should start the auto keyword highlighting -- at least it does in Firefox).
Btw., I don't think that there would be a license problem with the Top Words Highlighter script. It actually was a (paid) development for me by Pierre Carbonnelle, based on the original Greasemonkey script by hzhbest (who knows about it and showed his approval).
Thanks & Regards
David.P
QuHno # Wednesday, January 26, 2011 10:04:43 AM
A portable install should work just fine - at least mine did - as long as the path is set and can be found.
If you speak German you can look up ho to set up such scripts here, or else here.
Some Greasemonkey scripts need additional userJS files added to the userJS folder to emulate special GM_script functions, but this one behaves quite well. It could be possible to store and lock the edited keywords in Opera too, but storage in Opera and everything that belongs to it is something I still have to learn. I am no JS Guru, just an advanced copy&paster and sometimes debugger
edit (2011-01-28): I have reworked the highlighter script so that all functions are availlable in Opera now. It was a matter of binding the propietary Greasemonkey functions and methods to the W3C conform localStorage calls that Opera supports just fine (like Chrome and other standard compliant browsers too).
DavidGPeters # Wednesday, January 26, 2011 11:03:57 AM
klappt jetzt bestens! Aus irgendeinem Grund hatte ich das "aagmfunctions.js"-Skript in meinem Opera-Skriptordner, welches verhindert hat, dass der Auto Highligher funktioniert.
Nach dem Löschen von aagmfunctions.js funktioniert der Highligher jetzt genauso gut wie im Firefox!
Beste Grüße David.P
Unregistered user # Thursday, January 27, 2011 6:13:43 PM
QuHno # Thursday, January 27, 2011 7:34:48 PM
Explanation:
Extensions can only see the source code or the DOM by design of the Extensions API, so they can only remove or overrule the styles that are directly written or linked in the source code.
Extensions work with injected scripts and stylesheets and just have access to the DOM of the page but to nothing else.
A usercss can overrule everything that a webpage does and everything that an extension does too, because injected JS ans CSS are a part of the page.
The priority is as follows (highest to lowest) [1]:
- usercss !important
- page !important (This is what an extension can reach at maximum)
- page normal
- usercss normal
- browser's default css !important
- browser's default css regular
and that is completely independent of selector specificity so even if the usercss says this:span { color: lime !important; }and the page says this:html body div p span { color: red !important; }the usercss wins.It's designed to allow a page to style with the simplest of selectors, without the browser's default CSS from overriding it. The same goes for the usercss and the pagecss.
In short:
The extension itself works when used in combination with a usercss, but the usercss can always overrule the extension by adding !important to its selector styles.
*1) see also: W3C CSS - Cascading Order
abbottm # Friday, January 28, 2011 10:29:07 PM
Small bug: I have the "Convert hyperlinks to footnotes" option turned on. Press the cleanPages button while on an article with hyperlinks, then press the button again. All the footnote labels are duplicated.
Example: http://www.badscience.net/2011/01/tell-me-now-how-do-i-feel/
First press: "it was paid for by Sky Travel [1]"
Second press: "it was paid for by Sky Travel [1][1]"
Third press: ... you get the idea
QuHno # Friday, January 28, 2011 10:52:23 PM
Right, I didn't think about pressing the button more than once at the same page with convert to footnotes active. I hope I can catch that, it processes the link again and because the footnote anchor link there from the last click, it is doubled.
Now I have to think about a way to prevent links from being processed again without loosing the ability to reorder the footnotes if for example a chunk of text is selected before the second click. Not quite trivial ...
Thank You for reporting it!
edit:
And gone - at least I hope that I didn't break something else with the fix.
The download URL for version 1.0.3-alpha is at the top of the blog post.
If you encounter any problems with the fix, please leave a message here
Unregistered user # Saturday, January 29, 2011 7:06:28 PM
QuHno # Saturday, January 29, 2011 8:31:20 PM
Can't reproduce crashing on Linux, it worked fine on Linux Mint when I tested it ...
Any hint's why it is crashing?
Any errors in the JS console?
Special pages?
Unregistered user # Saturday, January 29, 2011 10:01:26 PM
QuHno # Sunday, January 30, 2011 10:24:05 PM
I tested them with Mint too. Several Macs tested it too, but no crashers.
Let's try to find the problem:
Trex 279trex279 # Monday, January 31, 2011 3:21:22 AM
Saskatchewan # Monday, January 31, 2011 3:13:16 PM
QuHno # Monday, January 31, 2011 10:24:57 PM
I know, but that is a difficult problem to solve. It is in the original Readability too and comes from ripping the whole page content into parts and replacing DIVs and TABLEs with P. So all the images the algorithm detects as belonging to the content have to be "rescued" from deletion with the non content stuff and are inserted afterwards. I am still investigating, how I can circumvent that.
BTW: Do you have an URL for me where the effect is particularly disturbing? It is easier with a life example ...
@ Saskatchewan:
Setting fixed widths instead of percentage widths could cause the content to be displayed with scrollbars on smaller screens - but I am thinking about some additional preferences that allow more settings. It may take a while for I am a slow programmer, but may be I'll add a fixed width setting in one of the next test versions.
Unregistered user # Tuesday, February 1, 2011 2:39:02 PM
NicoHellbillyDeluxe # Wednesday, February 2, 2011 8:38:56 AM
QuHno # Wednesday, February 2, 2011 11:17:21 AM
@ Trex 279:
I have disabled the small images left floating in the Alpha test version linked in the blog post above. Please give it a try.
QuHno # Thursday, February 3, 2011 9:19:39 AM
tt-21 wrote at the addons page:They are not removed if they are in the same (X)HTML container as the article - but unluckily that is not always the case. The Extension is not intelligent in a way a human is, it just follows the structure of the HTML code, so there have to be some hints in the page source that an image is content relevent to an article and not only decorative like i.e. in the navigation or in advertisements.
To the color thing: I plan to change some things in the options page, like the option to define your own colors, font-family, sizes etc., but it may take a while.
Bruno NascimentioBohemiaDrinker # Thursday, February 3, 2011 8:38:22 PM
I just have a very small feature request: could you add a keyboard shortcut, or the ability to custiomize one?
I don`t let the adress bar visible, so I`d like a way to trigger the extension without actually having to click on the button.
QuHno # Friday, February 4, 2011 5:12:13 AM
QuHno # Friday, February 4, 2011 1:01:22 PM
I don't know if I can copy the new UI (it is looking really functional), but it is worth a try. Life setting the preferences is quite nice.
BTW: The pay service pays gratifications to the authors of the pages that were made readable is a good idea in principle - but be aware, you send the URLs of all pages you made readable to them and they are gaining many referer backlinks in the server logs.
QuHno # Saturday, February 5, 2011 1:34:17 AM
I didn't ask for being listed there and I didn't want to be listed there and I didn't want my real name listed there and they didn't ask me if it is OK to use my real name there. I get enough SPAM mails and I don't need that extra amount.
Please don't download the extension from there, use the official Opera addons page. All extensions at the official addons page are reviewed by the Opera staff and are granted to not harm your Opera installation so there is exactly no need to download them from foreign sources.
Thank you.
metude # Saturday, February 5, 2011 6:31:47 PM
Just mail doesn't enough.
Mintshare (*) could be nice.
QuHno # Sunday, February 6, 2011 1:21:12 PM
I am building another interface for the settings, you can see a raw, unfinished and unstyled preview here. Just hover over the gray bar at the top left and change the view live.
With the planed interface I can add the share buttons easily
BTW: I suck at UI design.
I would really appreciate if someone could make a good looking image mock-up for me so that I can rebuild the interface by that guideline.
Please contact me by PM so that we can talk about details
Unregistered user # Sunday, February 6, 2011 6:54:28 PM
QuHno # Sunday, February 6, 2011 7:41:41 PM
Dragonfly doesn't work with it too (immediate freeze), so it is quite hard to analyze what is wrong with that page ...
edit:
It works if you delete #p2 from the end of the URL.
I think I have a problem with addresses that contain hashes ...
Added to my to-do list, I'll take a look into it.
TommyTommyAngelo # Monday, February 7, 2011 9:34:59 AM
- font size, sometimes is not the same as in preferences. I use big font, it shows normal sometimes, after changing the size to normal and then to big again it works.
- www.spiegel.de doesnt work sometimes
QuHno # Monday, February 7, 2011 10:56:34 AM
That shouldn't happen as long as the extension's storage is not cleared by something - but it could be related to the hash problem I mentioned in the previous comment too (I use the hash in the messaging, but I change that in the new version).
www.spiegel.de:
I h8 "sometimes" bugs
That is one of my daily visited pages and until now I didn't run into that problem - but it is a huge site with many pages and I don't read every article there. It would be fine if you could provide a direct link to one of the problem pages.
BTW: Does it help if you select content and then click on the button if that happens?
roboperasync # Monday, February 7, 2011 8:55:18 PM
I am the second one who whishes for implementing the possibility for a shortcut. I think its a big strenght of opera that you can easily use the keyboard for nearly everything. Maybe some days and my favourite extension will work by shortcut to
Thanx for implementing and keep on cooding for new features.
Saskatchewan # Tuesday, February 8, 2011 9:11:10 AM
Originally posted by BohemiaDrinker:
Originally posted by roboperasync:
Just wanted to say: You're not aloneQuHno # Wednesday, February 9, 2011 12:50:40 PM
... but it will take a while, it is a major rewrite and I still have some problems to solve, some of them are mentioned here in the comments but there are some more pages that don't run well. I am trying to find a generic solution for that because I can not (and I don't want to) add hacks for each and every site that doesn't play nicely with the extension.
Unregistered user # Wednesday, February 9, 2011 4:26:54 PM
QuHno # Wednesday, February 9, 2011 5:20:44 PM
I am just looking for a good way to build a font selector that shows some of the fonts that are actually installed on the user's system and that can be used for the extension. Unfortunately extensions don't have the same access to the system fonts as they have on the system color picker, so that is a little bit tricky ...
QuHno # Thursday, February 10, 2011 3:56:00 PM
Yes, I know, that you are not alone with that problem, but I can't fix what I can't replicate.
BTW: Can it be a Flash related problem? I just got the information that Flash on Linux hates it to be created and destroyed in short order...
What happens if Flash is deactivated? Does it still crash then?
Unregistered user # Saturday, February 19, 2011 11:46:22 PM
QuHno # Sunday, February 20, 2011 2:42:55 PM
I can't do the same things as www.readability.com does, because the cleaning of the page is done server sided, meaning:
You send the URL of the page to their server, their server grabs the page, rebuilds it and sends the cleaned page back to you. You can test it by saving the page locally, switch Opera to offline mode, open the saved page in Opera and try to use the readability.com bookmarklet. It won't work, whereas the extension does.
Extensions have some limitations when it comes to manipulate the source code of a page because they have to rely on the parsed DOM they get from the browser and that can be really ugly in Opera sometimes (see the blog Post about Error correction .... That makes ist impossible to do some of the real fancy stuff that is easy to do server sided (like running HTML-tidy over the code of a webpage before and after rebuilding it).
To rebuild the page I have to destroy the page source before corecting the errors. Unfortunately the error correction of Opera kicks in at these moments and prevents consistent results, so I can't do everything I want to do ...
At the moment I try to correct that in my internal version, but to do this, I have to write a rudimentary parser, that works independent from Opera's parser. That is no easy task and I don't know if I will succeed in doing so, so don't hold your breath.
Unregistered user # Monday, February 21, 2011 1:08:59 AM
QuHno # Monday, February 21, 2011 1:56:21 PM
The problem is not only the tag table (OK, that too. Any help to improve it is welcome), but that some elements can not be selected with DOM methods after the initial start of the cleaning process, because the error correction of the DOM parsers corrects partially rewritten code before the script can do so (as you can see here. "Result" is the same page reloaded as txt and "pre-cleaned" with some regex) It is some kind of race condition, the faster one wins and in 99% of all cases it is the parser.
This problem doesn't occur on the server sided closed version and in Safari, because those the scripts have direct access to the unchanged page code.
Bruce Couperzeno53 # Saturday, February 26, 2011 2:48:21 AM
So, thank you! I've spent much time away from Opera but the inclusion of extensions, the speed and the usual reasons draw me back. A few seconds of looking and I found your extension. Exactly what I needed! I wish Arc90 well with their endeavor but for a few reasons I no longer use it. Your alternative is excellent.
DitherDitherSky # Tuesday, March 15, 2011 8:37:42 PM
QuHno # Wednesday, March 16, 2011 12:16:31 PM
I am scripting on a major UI overhaul and I hope I can include that. I was thinking about a popup, but messaging between injected script, popup and background is quite ugly when you want to change settings from places other than the options page, I could need some help with that.
... but I have to solve some of the more "core" related bugs like the URL with # problem too, before I can finalize the new ideas to a test version.
QuHno # Thursday, March 17, 2011 9:46:06 AM
Please uninstall the old version before installing the experimental version.