0.1. Index
https://waterflow.link/articles/1666449874974
1. String encoding
In go rune is a unicode codepoint.
We all know that UTF-8 encodes characters into 1-4 bytes, such as our commonly used Chinese characters, UTF-8 encodes them into 3 bytes. So rune is also an alias for int32.
type rune = int32
When we print an English character hello, we can get the length of s to be 5, because English letters represent 1 byte:
package main import "fmt" func main() { s := "hello" fmt.Println(len(s)) // 5 }
But when we print hi, 3 bytes are printed. Since UTF-8 is used, this character will be encoded as 3 bytes:
package main import "fmt" func main() { s := "Hi" fmt.Println(len(s)) // 3 }
So, we use the len built-in function to output not the number of characters, but the number of bytes.
Let's look at an interesting example. We all know that Chinese characters are encoded with 3 bytes, which are 0xE5, 0x97, and 0xA8. We will get the Chinese character hi when we run the following code:
package main import "fmt" func main() { s := string([]byte{0xE5, 0x97, 0xA8}) fmt.Println(s) // Hi }
So we need to know:
- A character set is a set of characters, and an encoding describes how to convert the character set to binary
- In Go, strings refer to immutable slices of arbitrary bytes
- Go source code uses UTF-8 encoding. Therefore, all string literals are UTF-8 strings. But because a string can contain arbitrary bytes, if it's obtained from somewhere else (not source), it's not guaranteed to be UTF-8-based
- Using UTF-8, a Unicode character can be encoded as 1 to 4 bytes
- Using len on strings in Go returns bytes, not characters
2. String Traversal
We often use the scenario of traversing strings in development. Maybe we want to perform an action on each rune in the string, or implement a custom function to search for a specific substring. In both cases we have to iterate over the different characters of the string. But often we get unexpected results.
Let's take a look at the following example, which prints different characters and their corresponding positions in a string:
package main import "fmt" func main() { s := "h Hi llo" for i := range s { fmt.Printf("character position %d: %c\n", i, s[i]) } fmt.Printf("len=%d\n", len(s)) }
go run 7.go character position 0: h character position 1: å character position 4: l character position 5: l character position 6: o len=7
The effect we want is to print out the index of each character by iterating over the string. But instead we get a special character å, what we want is hi.
But the number of bytes printed is in line with our expectations, because hi is a Chinese language that occupies 3 bytes, so len returns 7.
3. The number of characters in the string
If we want to get the correct number of characters in a string, we can use the utf8 package in go:
package main import ( "fmt" "unicode/utf8" ) func main() { s := "h Hi llo" for i := range s { fmt.Printf("character position %d: %c\n", i, s[i]) } fmt.Printf("len=%d\n", len(s)) fmt.Printf(" rune len=%d\n", utf8.RuneCountInString(s)) // Get the number of characters }
go run 7.go character position 0: h character position 1: å character position 4: l character position 5: l character position 6: o len=7 rune len=5
In this example, you can see that we did traverse 5 times, which is 5 characters of the corresponding string. But the index we get is actually the starting position of each character. like below
So how do we print out the correct result? Let's modify the code slightly:
package main import ( "fmt" "unicode/utf8" ) func main() { s := "h Hi llo" for i, v := range s { // Change to get v here, you can get the character itself fmt.Printf("character position %d: %c\n", i, v) } fmt.Printf("len=%d\n", len(s)) fmt.Printf(" rune len=%d\n", utf8.RuneCountInString(s)) }
go run 7.go character position 0: h character position 1: Hi character position 4: l character position 5: l character position 6: o len=7 rune len=5
Another way is to convert the string to rune slice, which will also print the result correctly:
package main import ( "fmt" "unicode/utf8" ) func main() { s := "h Hi llo" b := []rune(s) for i := range b { fmt.Printf("character position %d: %c\n", i, b[i]) } fmt.Printf("len=%d\n", len(s)) fmt.Printf(" rune len=%d\n", utf8.RuneCountInString(s)) }
go run 7.go character position 0: h character position 1: Hi character position 2: l character position 3: l character position 4: o len=7 rune len=5
The following is the process of rune slice traversal (the process of converting bytes to rune is omitted in the middle, it is necessary to traverse bytes, and the complexity is O(n))
4. String trim
In development, we often encounter operations that remove characters from the head or tail of a string. For example, we now have a string xohelloxo, and now we want to remove the xo at the end, maybe we can write it like this:
package main import ( "fmt" "strings" ) func main() { s := "xohelloxo" s = strings.TrimRight(s, "xo") fmt.Println(s) }
go run 7.go xohell
As you can see this is not what we expected. We can see how TrimRight works:
- Take the first character o from the right, determine whether it is in xo, and remove it
- Repeat step 1 until the condition is not met
So it can be explained. Of course, it is the same principle as TrimLeft and Trim which are similar to it.
If we just want to remove the last xo we can use the TrimSuffix function:
package main import ( "fmt" "strings" ) func main() { s := "xohelloxo" s = strings.TrimSuffix(s, "xo") fmt.Println(s) }
go run 7.go xohello
Of course, there is also a corresponding function TrimPrefix that was deleted from the front.
5. String concatenation
In development, we often use the operation of connection strings. In go, we generally have two ways.
Let's first look at how the + sign is connected:
package main import ( "fmt" "strings" ) func implode(values []string, operate string) string { s := "" for _, value := range values { s += operate s += value } s = strings.TrimPrefix(s, operate) return s } func main() { a := []string{"hello", "world"} s := implode(a, " ") fmt.Println(s) }
go run 7.go hello world
The disadvantage of this method is that due to the invariance of strings, s will not be updated every time the + sign is assigned, but the memory will be re-allocated, so this method has a great impact on performance.
Another way is to use strings.Builder:
package main import ( "fmt" "strings" ) func implode(values []string, operate string) string { sb := strings.Builder{} for _, value := range values { _, _ = sb.WriteString(operate) _, _ = sb.WriteString(value) } s := strings.TrimPrefix(sb.String(), operate) return s } func main() { a := []string{"hello", "world"} s := implode(a, " ") fmt.Println(s) }
go run 7.go hello world
First, we create a strings.Builder structure. On each pass, we construct the resulting string by calling the WriteString method, which appends the contents of value to its internal buffer, minimizing memory copies.
The second parameter of WriteString returns error, but the value of error will always be nil. The second error parameter is because my strings.Builder implements the io.StringWriter interface, which contains a method: WriteString(s string) (n int, err error).
Let's see what the inside of WriteString looks like:
func (b *Builder) WriteString(s string) (int, error) { b.copyCheck() b.buf = append(b.buf, s...) return len(s), nil }
We can see that b.buf is a byte slice, and the implementation inside uses the append method. We know that if the slice is very large, using append will make the underlying array expand continuously, affecting the efficiency of code execution.
We know that the solution to this problem is that if we know the size of the slice in advance, we can allocate the capacity of the slice at initialization time.
So there is another optimization scheme for the above string concatenation:
package main import ( "fmt" "strings" ) func implode(values []string, operate string) string { total := 0 for i := 0; i < len(values); i++ { total += len(values[i]) } total += len(operate) * len(values) sb := strings.Builder{} sb.Grow(total) // This will redistribute the length and capacity of b.buf for _, value := range values { _, _ = sb.WriteString(operate) _, _ = sb.WriteString(value) } s := strings.TrimPrefix(sb.String(), operate) return s } func main() { a := []string{"hello", "world"} s := implode(a, " ") fmt.Println(s) }
go run 7.go hello world
6, byte slice to string
To be clear, converting a byte slice into a string requires a copy. You can verify it with the following code:
b := []byte{'a', 'b', 'c'} s := string(b) b[1] = 'x' fmt.Println(s)
In fact, the above will output abc instead of axc. So the conversion of byte slices to strings has overhead.
But we often use packages such as iio.Read in our development, and the input or return is often a byte slice type. And when we call these functions often in the form of strings, we have to do some byte slicer string conversions.
So the conclusion is that when we need to use strings as input parameters or returns, the first thing we need to consider is to use byte slices if we can use byte slices.