您的位置: 首页 > 技术杂谈 > 正文

Python文本统计与分析从基础到进阶

2024-05-06 12:00 https://my.oschina.net/u/4526289/blog/11090410 华为云开发者联盟次阅读条评论

本文分享自华为云社区《Python文本统计与分析从基础到进阶》，作者：柠檬味拥抱。

在当今数字化时代，文本数据无处不在，它们包含了丰富的信息，从社交媒体上的帖子到新闻文章再到学术论文。对于处理这些文本数据，进行统计分析是一种常见的需求，而Python作为一种功能强大且易于学习的编程语言，为我们提供了丰富的工具和库来实现文本数据的统计分析。本文将介绍如何使用Python来实现文本英文统计，包括单词频率统计、词汇量统计以及文本情感分析等。

单词频率统计

单词频率统计是文本分析中最基本的一项任务之一。Python中有许多方法可以实现单词频率统计，以下是其中一种基本的方法：

def count_words(text):
    # 将文本中的标点符号去除并转换为小写
    text = text.lower()
    for char in '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~':
        text = text.replace(char, ' ')
    
    # 将文本拆分为单词列表
    words = text.split()

    # 创建一个空字典来存储单词计数
    word_count = {}
    
    # 遍历每个单词并更新字典中的计数
    for word in words:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    
    return word_count

# 测试代码
if __name__ == "__main__":
    text = "This is a sample text. We will use this text to count the occurrences of each word."
    word_count = count_words(text)
    for word, count in word_count.items():
        print(f"{word}: {count}")

这段代码定义了一个函数 count_words(text)，它接受一个文本字符串作为参数，并返回一个字典，其中包含文本中每个单词及其出现的次数。下面是对代码的逐行解析：

def count_words(text):：定义了一个函数 count_words，该函数接受一个参数 text，即要处理的文本字符串。

text = text.lower()：将文本字符串转换为小写字母，这样可以使单词统计不受大小写影响。

for char in '!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~’:`：这是一个循环，遍历了文本中的所有标点符号。

text = text.replace(char, ' ')：将文本中的每个标点符号替换为空格，这样可以将标点符号从文本中删除。

words = text.split()：将处理后的文本字符串按空格分割为单词列表。

word_count = {}：创建一个空字典，用于存储单词计数，键是单词，值是该单词在文本中出现的次数。

for word in words:：遍历单词列表中的每个单词。

if word in word_count:：检查当前单词是否已经在字典中存在。

word_count[word] += 1：如果单词已经在字典中存在，则将其出现次数加1。

else:：如果单词不在字典中，执行以下代码。

word_count[word] = 1：将新单词添加到字典中，并将其出现次数设置为1。

return word_count：返回包含单词计数的字典。

if __name__ == "__main__":：检查脚本是否作为主程序运行。

text = "This is a sample text. We will use this text to count the occurrences of each word."：定义了一个测试文本。

word_count = count_words(text)：调用 count_words 函数，将测试文本作为参数传递，并将结果保存在 word_count 变量中。

for word, count in word_count.items():：遍历 word_count 字典中的每个键值对。

print(f"{word}: {count}")：打印每个单词和其出现次数。

运行结果如下

进一步优化与扩展

import re
from collections import Counter
def count_words(text):
    # 使用正则表达式将文本分割为单词列表（包括连字符单词）
    words = re.findall(r'\b\w+(?:-\w+)*\b', text.lower())

    # 使用Counter来快速统计单词出现次数
    word_count = Counter(words)

    return word_count
# 测试代码
if __name__ == "__main__":
    text = "This is a sample text. We will use this text to count the occurrences of each word."
    word_count = count_words(text)
    for word, count in word_count.items():
        print(f"{word}: {count}")

这段代码与之前的示例相比有以下不同之处：