Html To Readable Text
I'm writing a program to run on Google App Engine. Which simply get an URL and return the text by removing markups, scripts and any other non-readable things from its HTML source(s
Solution 1:
That's because there's a mix between unicode and bytestrings.
If you use in the module with HtmlTool
from __future__ import unicode_literals
To ensure every " "
-like blocks are unicode, and
text = HtmlTool.to_nice_text(urllib.urlopen(url).read().decode("utf-8"))
To send a UTF-8 string to your method, that solves your problem.
Please read this for more information about Unicode: http://www.joelonsoftware.com/articles/Unicode.html
Post a Comment for "Html To Readable Text"