#keywording | Explore Tumblr Posts and Blogs

pythonprogrammingsnippets · 1 year

Text

python keyword extraction using nltk wordnet

import re # include wordnet.morphy from nltk.corpus import wordnet # https://pythonprogrammingsnippets.tumblr.com/ def get_non_plural(word): # return the non-plural form of a word # if word is not empty if word != "": # get the non-plural form non_plural = wordnet.morphy(word, wordnet.NOUN) # if non_plural is not empty if non_plural != None: # return the non-plural form # print(word, "->", non_plural) return non_plural # if word is empty or non_plural is empty return word def get_root_word(word): # return the root word of a word # if word is not empty if word != "": word = get_non_plural(word) # get the root word root_word = wordnet.morphy(word) # if root_word is not empty if root_word != None: # return the root word # print(word, "->", root_word) word = root_word # if word is empty or root_word is empty return word def process_keywords(keywords): ret_k = [] for k in keywords: # replace all characters that are not letters, spaces, or apostrophes with a space k = re.sub(r"[^a-zA-Z' ]", " ", k) # if there is more than one whitespace in a row, replace it # with a single whitespace k = re.sub(r"\s+", " ", k) # remove leading and trailing whitespace k = k.strip() k = k.lower() # if k has more than one word, split it into words and add each word # back to keywords if " " in k: ret_k.append(k) # we still want the original keyword k = k.split(" ") for k2 in k: #if not is_adjective(k2): ret_k.append(get_root_word(k2)) ret_k.append(k2.strip()) else: # if not is_adjective(k): ret_k.append(get_root_word(k)) ret_k.append(k.strip()) # unique ret_k = list(set(ret_k)) # remove empty strings ret_k = [k for k in ret_k if k != ""] # remove all words that are less than 3 characters ret_k = [k for k in ret_k if len(k) >= 3] # remove words like 'and', 'or', 'the', etc. ret_k = [k for k in ret_k if k not in ["and", "or", "the", "a", "an", "of", "to", "in", "on", "at", "for", "with", "from", "by", "as", "into", "like", "through", "after", "over", "between", "out", "against", "during", "without", "before", "under", "around", "among", "throughout", "despite", "towards", "upon", "concerning", "of", "to", "in", "on", "at", "for", "with", "from", "by", "as", "into", "like", "through", "after", "over", "between", "out", "against", "during", "without", "before", "under", "around", "among", "throughout", "despite", "towards", "upon", "concerning", "this", "that", "these", "those", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "will", "would", "shall", "should", "can", "could", "may", "might", "must", "ought", "i", "me", "my", "mine", "we", "us", "our", "ours", "you", "your", "yours", "he", "him", "his", "she", "her", "hers", "it", "its", "they", "them", "their", "theirs", "what", "which", "who", "whom", "whose", "this", "that", "these", "those", "myself", "yourself", "himself", "herself", "itself", "ourselves", "yourselves", "themselves", "whoever", "whatever", "whomever", "whichever", "whichever" ]] return ret_k def extract_keywords(paragraph): if " " in paragraph: return paragraph.split(" ") return [paragraph]

example usage:

the_string = "Jims House of Judo and Karate is a martial arts school in the heart of downtown San Francisco. We offer classes in Judo, Karate, and Jiu Jitsu. We also offer private lessons and group classes. We have a great staff of instructors who are all black belts. We have been in business for over 20 years. We are located at 123 Main Street." keywords = process_keywords(extract_keywords(the_string)) print(keywords)

output:

# output: ['jims', 'instructors', 'class', 'lesson', 'all', 'school', 'san', 'martial', 'classes', 'karate', 'great', 'lessons', 'downtown', 'private', 'arts', 'also', 'locate', 'belts', 'business', 'judo', 'years', 'located', 'main', 'street', 'jitsu', 'house', 'offer', 'staff', 'group', 'heart', 'instructor', 'belt', 'black', 'francisco', 'jiu']

1 note · View note

mostlysignssomeportents · 13 days

Text

Google is (still) losing the spam wars to zombie news-brands

I'm touring my new, nationally bestselling novel The Bezzle! Catch me TONIGHT (May 3) in CALGARY, then TOMORROW (May 4) in VANCOUVER, then onto Tartu, Estonia, and beyond!

Even Google admits – grudgingly – that it is losing the spam wars. The explosive proliferation of botshit has supercharged the sleazy "search engine optimization" business, such that results to common queries are 50% Google ads to spam sites, and 50% links to spam sites that tricked Google into a high rank (without paying for an ad):

https://developers.google.com/search/blog/2024/03/core-update-spam-policies#site-reputation

It's nice that Google has finally stopped gaslighting the rest of us with claims that its search was still the same bedrock utility that so many of us relied upon as a key piece of internet infrastructure. This not only feels wildly wrong, it is empirically, provably false:

https://downloads.webis.de/publications/papers/bevendorff_2024a.pdf

Not only that, but we know why Google search sucks. Memos released as part of the DOJ's antitrust case against Google reveal that the company deliberately chose to worsen search quality to increase the number of queries you'd have to make (and the number of ads you'd have to see) to find a decent result:

https://pluralistic.net/2024/04/24/naming-names/#prabhakar-raghavan

Google's antitrust case turns on the idea that the company bought its way to dominance, spending the some of the billions it extracted from advertisers and publishers to buy the default position on every platform, so that no one ever tried another search engine, which meant that no one would invest in another search engine, either.

Google's tacit defense is that its monopoly billions only incidentally fund these kind of anticompetitive deals. Mostly, Google says, it uses its billions to build the greatest search engine, ad platform, mobile OS, etc that the public could dream of. Only a company as big as Google (says Google) can afford to fund the R&D and security to keep its platform useful for the rest of us.

That's the "monopolistic bargain" – let the monopolist become a dictator, and they will be a benevolent dictator. Shriven of "wasteful competition," the monopolist can split their profits with the public by funding public goods and the public interest.

Google has clearly reneged on that bargain. A company experiencing the dramatic security failures and declining quality should be pouring everything it has to righting the ship. Instead, Google repeatedly blew tens of billions of dollars on stock buybacks while doing mass layoffs:

https://pluralistic.net/2024/02/21/im-feeling-unlucky/#not-up-to-the-task

Those layoffs have now reached the company's "core" teams, even as its core services continue to decay:

https://qz.com/google-is-laying-off-hundreds-as-it-moves-core-jobs-abr-1851449528

(Google's antitrust trial was shrouded in secrecy, thanks to the judge's deference to the company's insistence on confidentiality. The case is moving along though, and warrants your continued attention:)

https://www.thebignewsletter.com/p/the-2-trillion-secret-trial-against

Google wormed its way into so many corners of our lives that its enshittification keeps erupting in odd places, like ordering takeout food:

https://pluralistic.net/2023/02/24/passive-income/#swiss-cheese-security

Back in February, Housefresh – a rigorous review site for home air purifiers – published a viral, damning account of how Google had allowed itself to be overrun by spammers who purport to provide reviews of air purifiers, but who do little to no testing and often employ AI chatbots to write automated garbage:

https://housefresh.com/david-vs-digital-goliaths/

In the months since, Housefresh's Gisele Navarro has continued to fight for the survival of her high-quality air purifier review site, and has received many tips from insiders at the spam-farms and Google, all of which she recounts in a followup essay:

https://housefresh.com/how-google-decimated-housefresh/

One of the worst offenders in spam wars is Dotdash Meredith, a content-farm that "publishes" multiple websites that recycle parts of each others' content in order to climb to the top search slots for lucrative product review spots, which can be monetized via affiliate links.

A Dotdash Meredith insider told Navarro that the company uses a tactic called "keyword swarming" to push high-quality independent sites off the top of Google and replace them with its own garbage reviews. When Dotdash Meredith finds an independent site that occupies the top results for a lucrative Google result, they "swarm a smaller site’s foothold on one or two articles by essentially publishing 10 articles [on the topic] and beefing up [Dotdash Meredith sites’] authority."

Dotdash Meredith has keyword swarmed a large number of topics. from air purifiers to slow cookers to posture correctors for back-pain:

https://housefresh.com/wp-content/uploads/2024/05/keyword-swarming-dotdash.jpg

The company isn't shy about this. Its own shareholder communications boast about it. What's more, it has competition.

Take Forbes, an actual news-site, which has a whole shadow-empire of web-pages reviewing products for puppies, dogs, kittens and cats, all of which link to high affiliate-fee-generating pet insurance products. These reviews are not good, but they are treasured by Google's algorithm, which views them as a part of Forbes's legitimate news-publishing operation and lets them draft on Forbes's authority.

This side-hustle for Forbes comes at a cost for the rest of us, though. The reviewers who actually put in the hard work to figure out which pet products are worth your money (and which ones are bad, defective or dangerous) are crowded off the front page of Google and eventually disappear, leaving behind nothing but semi-automated SEO garbage from Forbes:

https://twitter.com/ichbinGisele/status/1642481590524583936

There's a name for this: "site reputation abuse." That's when a site perverts its current – or past – practice of publishing high-quality materials to trick Google into giving the site a high ranking. Think of how Deadspin's private equity grifter owners turned it into a site full of casino affiliate spam:

https://www.404media.co/who-owns-deadspin-now-lineup-publishing/

The same thing happened to the venerable Money magazine:

https://moneygroup.pr/

Money is one of the many sites whose air purifier reviews Google gives preference to, despite the fact that they do no testing. According to Google, Money is also a reliable source of information on reprogramming your garage-door opener, buying a paint-sprayer, etc:

https://money.com/best-paint-sprayer/

All of this is made ten million times worse by AI, which can spray out superficially plausible botshit in superhuman quantities, letting spammers produce thousands of variations on their shitty reviews, flooding the zone with bullshit in classic Steve Bannon style:

https://escapecollective.com/commerce-content-is-breaking-product-reviews/

As Gizmodo, Sports Illustrated and USA Today have learned the hard way, AI can't write factual news pieces. But it can pump out bullshit written for the express purpose of drafting on the good work human journalists have done and tricking Google – the search engine 90% of us rely on – into upranking bullshit at the expense of high-quality information.

A variety of AI service bureaux have popped up to provide AI botshit as a service to news brands. While Navarro doesn't say so, I'm willing to bet that for news bosses, outsourcing your botshit scams to a third party is considered an excellent way of avoiding your journalists' wrath. The biggest botshit-as-a-service company is ASR Group (which also uses the alias Advon Commerce).

Advon claims that its botshit is, in fact, written by humans. But Advon's employees' Linkedin profiles tell a different story, boasting of their mastery of AI tools in the industrial-scale production of botshit:

https://housefresh.com/wp-content/uploads/2024/05/Advon-AI-LinkedIn.jpg

Now, none of this is particularly sophisticated. It doesn't take much discernment to spot when a site is engaged in "site reputation abuse." Presumably, the 12,000 googlers the company fired last year could have been employed to check the top review keyword results manually every couple of days and permaban any site caught cheating this way.

Instead, Google is has announced a change in policy: starting May 5, the company will downrank any site caught engaged in site reputation abuse. However, the company takes a very narrow view of site reputation abuse, limiting punishments to sites that employ third parties to generate or uprank their botshit. Companies that produce their botshit in-house are seemingly not covered by this policy.

As Navarro writes, some sites – like Forbes – have prepared for May 5 by blocking their botshit sections from Google's crawler. This can't be their permanent strategy, though – either they'll have to kill the section or bring it in-house to comply with Google's rules. Bringing things in house isn't that hard: US News and World Report is advertising for an SEO editor who will publish 70-80 posts per month, doubtless each one a masterpiece of high-quality, carefully researched material of great value to Google's users:

https://twitter.com/dannyashton/status/1777408051357585425

As Navarro points out, Google is palpably reluctant to target the largest, best-funded spammers. Its March 2024 update kicked many garbage AI sites out of the index – but only small bottom-feeders, not large, once-respected publications that have been colonized by private equity spam-farmers.

All of this comes at a price, and it's only incidentally paid by legitimate sites like Housefresh. The real price is borne by all of us, who are funneled by the 90%-market-share search engine into "review" sites that push low quality, high-price products. Housefresh's top budget air purifier costs $79. That's hundreds of dollars cheaper than the "budget" pick at other sites, who largely perform no original research.

Google search has a problem. AI botshit is dominating Google's search results, and it's not just in product reviews. Searches for infrastructure code samples are dominated by botshit code generated by Pulumi AI, whose chatbot hallucinates nonexistence AWS features:

https://www.theregister.com/2024/05/01/pulumi_ai_pollution_of_search/

This is hugely consequential: when these "hallucinations" slip through into production code, they create huge vulnerabilities for widespread malicious exploitation:

https://www.theregister.com/2024/03/28/ai_bots_hallucinate_software_packages/

We've put all our eggs in Google's basket, and Google's dropped the basket – but it doesn't matter because they can spend $20b/year bribing Apple to make sure no one ever tries a rival search engine on Ios or Safari:

https://finance.yahoo.com/news/google-payments-apple-reached-20-220947331.html

Google's response – laying off core developers, outsourcing to low-waged territories with weak labor protections and spending billions on stock buybacks – presents a picture of a company that is too big to care:

https://pluralistic.net/2024/04/04/teach-me-how-to-shruggie/#kagi

Google promised us a quid-pro-quo: let them be the single, authoritative portal ("organize the world’s information and make it universally accessible and useful"), and they will earn that spot by being the best search there is:

https://www.ft.com/content/b9eb3180-2a6e-41eb-91fe-2ab5942d4150

But – like the spammers at the top of its search result pages – Google didn't earn its spot at the center of our digital lives.

It cheated.

If you'd like an essay-formatted version of this post to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

https://pluralistic.net/2024/05/03/keyword-swarming/#site-reputation-abuse

Image: freezelight (modified) https://commons.wikimedia.org/wiki/File:Spam_wall_-_Flickr_-_freezelight.jpg

CC BY-SA 2.0 https://creativecommons.org/licenses/by-sa/2.0/deed.en