python - Python 中字符串的基数排序

与 Python 的排序相比，我的基数排序函数输出已排序但错误的列表:

My radix sort: ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
Python's sort: ['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']

* 我的基数排序不做填充
* 其机制是最低有效位(LSB)
* 我需要利用每个单词的长度

以下是我的代码。

def count_sort_letters(array, size, col, base):
    output   = [0] * size
    count    = [0] * base
    min_base = ord('a')

    for item in array:
        correct_index = min(len(item) - 1, col)
        letter = ord(item[-(correct_index + 1)]) - min_base
        count[letter] += 1

    for i in range(base - 1):
        count[i + 1] += count[i]

    for i in range(size - 1, -1, -1):
        item = array[i]
        correct_index = min(len(item) - 1, col)
        letter = ord(item[-(correct_index + 1)]) - min_base
        output[count[letter] - 1] = item
        count[letter] -= 1

    return output


def radix_sort_letters(array):
    size = len(array)

    max_col = len(max(array, key = len))

    for col in range(max_col):
        array = count_sort_letters(array, size, col, 26)

    return array

谁能找到解决这个问题的方法？

最佳答案

正如我在评论中提到的:

In your code the lines:

correct_index = min(len(item) - 1, col)
letter = ord(item[-(correct_index + 1)]) - min_base

Always uses the first letter of the word once col is greater than the word length. This causes shorter words to be sorted based upon their first letter once col is greater than the word length. For instance ['aa', 'a'] remains unchanged since on the for col loop we compare the 'a' in both words, which keeps the results unchanged.

代码修正

注意:已尝试尽量减少对原始代码的更改

def count_sort_letters(array, size, col, base, max_len):
  """ Helper routine for performing a count sort based upon column col """
  output   = [0] * size
  count    = [0] * (base + 1) # One addition cell to account for dummy letter
  min_base = ord('a') - 1 # subtract one too allow for dummy character

  for item in array: # generate Counts
    # get column letter if within string, else use dummy position of 0
    letter = ord(item[col]) - min_base if col < len(item) else 0
    count[letter] += 1

  for i in range(len(count)-1):   # Accumulate counts
      count[i + 1] += count[i]

  for item in reversed(array):
    # Get index of current letter of item at index col in count array
    letter = ord(item[col]) - min_base if col < len(item) else 0
    output[count[letter] - 1] = item
    count[letter] -= 1

  return output

def radix_sort_letters(array, max_col = None):
  """ Main sorting routine """
  if not max_col:
    max_col = len(max(array, key = len)) # edit to max length

  for col in range(max_col-1, -1, -1): # max_len-1, max_len-2, ...0
    array = count_sort_letters(array, len(array), col, 26, max_col)

  return array

lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))

测试

lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))

# Compare to Python sort
print(radix_sort_letters(lst)==sorted(lst))

输出

['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']
True

解释

计数排序 是一个 stable sort含义:

让我们通过一个示例来了解该函数的工作原理。

让我们排序:['ac', 'xb', 'ab']

我们以相反的顺序遍历每个列表中的每个字符。

迭代 0:

Key is last character in list (i.e. index -1):       
keys are ['c','b', 'b'] (last characters of 'ac', 'xb', and 'ab'

Peforming a counting sort on these keys we get ['b', 'b', 'c']

This causes the corresponding words for these keys to be placed in    
the order:    ['xb', 'ab', 'ac']

Entries 'xb' and 'ab' have equal keys (value 'b') so they maintain their 
order of 'xb' followed by 'ab' of the original list 
(since counting sort is a stable sort)

迭代 1:

Key is next to last character (i.e. index -2):

Keys are ['x', 'a', 'a'] (corresponding to list ['xb', 'ab', 'ac'])

Counting Sort produces the order ['a', 'a', 'a']
which causes the corresponding words to be placed in the order
['ab', 'ac', 'xb'] and we are done.

原始软件错误——您的代码最初是从左到右而不是从右到左遍历字符串。我们需要从右到左，因为我们希望最后一个排序基于第一个字符，倒数第二个基于第二个字符，依此类推。

不同长度的字符串 - 上面的例子是等长字符串。

前面的例子被简化为假设等长字符串。现在让我们尝试不等长的字符串，例如:

['ac', 'a', 'ab']

这立即出现了一个问题，因为单词的长度不相等，我们不能每次都选择一个字母。

我们可以通过用一个虚拟字符(例如“*”)填充每个单词来修复:

['ac', 'a*', 'ab']

迭代 0:键是每个单词的最后一个字符，因此:['c', '*', 'b']

The understanding is that the dummy character is less than all other
characters, so the sort order will be:
['*', 'b', 'c'] causing the related words to be sorted in the order

['a*', 'ab', 'ac']

迭代 1:键位于每个单词中最后一个字符的旁边，因此:['a', 'a', 'a']

 Since the keys are all equal counting sort won't change the order so we keep

  ['a*', 'ab', 'ac']

Removing the dummy character from each string (if any) we end up with:

    ['a', 'ab', 'ac']
The idea behind get_index is to mimic the behavior of padding strings without actual padding (i.e. padding is extra work). Thus, based upon the index it evaluates if the index points to the padded or unpadded portion of the string and returns an appropriate index into the counting array for counting.

关于python - Python 中字符串的基数排序，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60968950/

python - Python 中字符串的基数排序

上一篇：flutter - 在 main.dart 中使用 FutureBuilder

下一篇：jupyter-notebook - Voila for Jupyter 引发错误 : raise NotImplementedError