scrapy item赋值/填充细节注意

1. 起因

我在爬取到微博热搜的json数据之后，发无论如何都会报错数据库字段方面的问题，具体报错如下：

Traceback (most recent call last):
  File "D:\anaconda\envs\DjangoEnv\Lib\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "D:\anaconda\envs\DjangoEnv\Lib\site-packages\scrapy\utils\defer.py", line 307, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "D:\Aproject\django-project\project5-scrapy-tutorial\project_2\tutorial\tutorial\pipelines.py", line 46, in process_item    
    self.db[collection_name].insert_one(dict(hot))
ValueError: dictionary update sequence element #0 has length 1; 2 is required

具体的原因一直不清楚，只能模糊的猜测实在向数据库导入数据的时候，因为格式的原因出错了，但具体是什么原因，我把代码看了一遍又一遍怎么也找不出来哪里错了。猜测可能是爬取的json数据解析错误，所以将json数据下载到文件中，然后在jupyter中不断调试不断找，看看我是把那个括号给漏了😡。
最后实在是找不到了，想着是不是可以通过调试一步一步的判断哪里出错了，但是我没有在scrapy的文档中找到关于pipeline的调试方法（只有关于spider的），最后在知乎上一篇文章找到了方法，参考链接，但是作者的方法在我（windows11+vsc）运行之后会报错，之后参考了评论区的方法，在和scrapy.cfg同一层（项目根目录中）中建立run.py文件，输入以下代码，再在项目中设置断点，然后debug文件run.py，就可以实现调试的功能。

具体代码：

import os
from scrapy.cmdline import execute
os.chdir(os.path.dirname(os.path.realpath(__file__)))
try:
    execute(
    [
    'scrapy',
    'crawl',
    'weibo', #这里换成对应的spider名字
    '-o',
    'out.json',
    ]
    )
except SystemExit:
    pass

2. 调试之后

在调试之后，我发现pipeline.py文件中对应class类中的process_item()方法中的item变量并不是我想象中的是一个由字典元素组成的列表，而是一个字典，并且key是在item.py文件中设定的，value是在parse()方法中赋值的、我想要的字典元素列表。终于确定的原因，因此修改也很简单。

3. 修改

将原本的process_item()方法修改即可

def process_item(self,item,spider):
    now = datetime.datetime.now() #以当前事件作为collection的名字
    collection_name = datetime.datetime.strftime(now,'%Y-%m-%d:%H:%M:%S')

    #原来错误的： 
    #for hot in item:  ->修改为下面的部分
    for hot in item['realtime']: 
        self.db[collection_name].insert_one(dict(hot))
    return item

4. 探究原因

官方文档
首先，开发者为了方便在python中处理爬到的web数据，因此将item类设计为类字典结构，并且完全copy了python中dict的api（cv大法好😋）

Item objects replicate the standard dict API, including its __init__ method.

因此实际上，item类就是一个在scrapy中的spider, pipeline之间进行数据交换的字典类。他的流程就是：
第一步，在item.py文件中设定item的key值；
第二部，在spider中的parse方法中被解析好的网页数据填充value;
第三步，通过pipeline保存成为文件/保存到数据库等

详细的例子如下：
4.1 在items.py中规定一个类，如下：

class Example(scrapy.Item): #必须继承scrapy.Item 才能使用对应的api
    realtime = scrapy.Field()

这里的Field()的作用
PS.官方文档
" The Field class is just an alias to the built-in dict class and doesn’t provide any extra functionality or attributes." 表明Field类之际上只是python内置字典的别名，没有其他任何别的作用（Field()源码解析），当然复杂的而是item如何（说实话没看懂😵item源码解析）。
当然也可以重写Overriding the serialize_field() method方法，去规定具体的数据类型（具体参考官方文档）
整个item类的使用非常类似与Django的Form类，不过Field()规定的字段类型是远远简单与Django的。

4.2 然后再spider中我爬取到的是一个json数据，例如

{
"ok": 1,
    "data": {
        "realtime": [
                {
                "star_name": {},
                "word_scheme": "#新闻标题#",
                "emoticon": "",
            }
        ]}
}

4.3 然后我在对item.py文件中的类进行填充，如下：

class ExampleSpider(scrapy.Spider):
    ...#略
    def parse(self, response):
            jsondata = json.loads(response.text)          #使用scrapy中response属性text将爬取到的网页解析为str，然后使用json.loads方法转化为字典格式
            realtime = jsondata['data']['realtime'] #提取热搜数据列表
            item = Example() #实例化item
            item['realtime'] = realtime
            return item

这里需要重点注意 item['realtime'] = realtime，虽然在之前已经将realtime列表提取出来，但是在填充的时候，传入pipeline中进行储存的实际上是一个item字典，字典key是在item.py文件中定义的属性，字典value是在parse方法中填充的对象。所以实际上传入到pipeline模块中的item结构是如下：

{ 
    'realtime': [{
        "star_name": {},
        "word_scheme": "#新闻标题#",
        "emoticon": "",
    },
    ]
}

4.4 也就是意味着，如果你在pipeline.py文件的对应类中的process_item方法中，如果需要用到原本的列表，首先从item字典中提取出来，具体如下：

    def process_item(self,item,spider):
            #....
        for hot in item['realtime']:
            #....
        return item

cyanine