How to tokenize a string in Python
String tokenization is a process where a string is broken into several parts. Each part is called a token.
While working on data, we need to perform the string tokenization of the strings that we might get as input. This particular thing has uses in many applications of Machine Learning.
Method 1 : Using list comprehension + split()
sample = ['Python is easy', ' powerful', 'programming language'] print('Original list: ' + str(sample)) #tokenizing string out = [i.split() for i in sample] print('After tokenizing: ' + str(out))
Output:
Original list: ['Python is easy', ' powerful', 'programming language'] After tokenizing: [['Python', 'is', 'easy'], ['powerful'], ['programming', 'language']]
Method 2 : Using map() + split()
sample = ['Python is easy', ' powerful', 'programming language'] print('Original list: ' + str(sample)) #tokenizing string out = list(map(str.split, sample)) print('After tokenizing: ' + str(out))
Output:
Original list: ['Python is easy', ' powerful', 'programming language'] After tokenizing: [['Python', 'is', 'easy'], ['powerful'], ['programming', 'language']]
Subscribe
Login
Please login to comment
0 Discussion