text-mining

It would be great to have more friendly and funny doctest text content (instead of "Aha", "Text", ...). It's also nicer for users if the docstring examples are all similar.

One idea, for instance, is to use famous sentences said by movie Superheroes. Here are a few examples:

I have the power!
Flame on!
HULK SMASH!
Holy ____ Batman!
I am the vengeance, I am the night, I am BATMAN!

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in [xpaths.py](https://github.com

text-mining

Here are 1,695 public repositories matching this topic...

keon / awesome-nlp

deanmalmgren / textract

jbesomi / texthero

Matching Content in our Doctests

Add version attribute

My wordcloud looks ugly. Which argument to change to make it look cleaner?

chiphuyen / lazynlp

JasonKessler / scattertext

ujjwalkarn / DataScienceR

mathsyouth / awesome-text-summarization

konlpy / konlpy

dgrtwo / tidy-text-mining

juliasilge / tidytext

shangjingbo1226 / AutoPhrase

gsh199449 / spider

kavgan / nlp-in-practice

csurfer / rake-nltk

dselivanov / text2vec

nlptown / nlp-notebooks

DemonDamon / Listed-company-news-crawl-and-text-analysis

bigartm / bigartm

opensemanticsearch / open-semantic-search

cpsievert / LDAvis

laugustyniak / awesome-sentiment-analysis

stepthom / text_mining_resources

stephenhky / PyShortTextCategorization

adbar / trafilatura

List of smaller extraction bugs (text & metadata)

graphbrain / graphbrain

kk7nc / RMDL

airbnb / artificial-adversary

jmartinezheras / 2018-MachineLearning-Lectures-ESA

ropensci / rplos

lining0806 / TextMining

Improve this page

Add this topic to your repo